US20160063040A1

Movatterモバイル変換

Info

Publication number: US20160063040A1
Application number: US14/473,851
Authority: US
Inventors: Eric Justin Kraemer; Eric Tristan Lemoine
Original assignee: Exara Inc
Current assignee: Accenture Global Solutions Ltd
Priority date: 2014-08-29
Filing date: 2014-08-29
Publication date: 2016-03-03
Also published as: US9864764B2; US20180107693A1; US10769121B2

Abstract

A data archive constructed from source data, whose structure and associated schemas can evolve based on the generated responses to user data requests. Based on the analysis of the responses, the schema and/or archive structure can be modified to provide greater knowledge, definition and operations available to be performed on the data, as well as to reduce the processing and storage costs associated with housing and accessing the data within the archive.

Description

FIELD OF THE INVENTION

The field of the invention is data access and storage technologies.

BACKGROUND

The background description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.

The evolution of computing and networking technologies have made data collection, storage, and analysis increasingly easier to perform, and at a continuously larger scale. The ever-decreasing size of network-capable computing devices have increased the number of data sources gathering or creating data, the types of data available from these sources, and the overall amount of data available. Likewise, advancements in data communications and storage technologies have enabled entities interested in data from these sources to collect increasingly larger amounts of data in databases or other data stores. This exponential growth in digitized data generation and collection continues to be fueled by machine generated data in provenance of devices such as sensors and probes that can monitor, measure and assert health, behavior, state, environment and performance of many types of machines and man-made systems, as well as humans and many aspects of the natural world.

The collection of data on such a scale allows for analysis that can result in discoveries and advancements, across various fields of study, which were not previously possible. For example, a medical researcher can use medical information gathered by wearable devices or sensors outside of a hospital setting to analyze health or medical patterns across a population. In another example, an advertiser can use online behaviors of a population of users to determine product trends, interests and advertisement effectiveness within a population.

However, certain data generating devices such as sensors or probes are capable of generating data flows that, while digital, reflect their analog nature (and moreover, is often non-linear) or simply cannot be classified as symbolic and human-readable.

To explore, discover and extract pertinent information out of these new data flows, an incremental process is required that allows for starting at a state where very little is known about the data, and provides for development towards a full data model at both the data consumption side and also at the data repository level. The complexity of this task requires methods far beyond simple numeric comparison and/or textual search. For example, signal data, instead of processing it before it is stored (resulting in loss of information), should be stored as it is and signal processing techniques (e.g. FFT) then be used to extract a relevant view of the signal. This process is recursive in nature and, as such, deciding the meaning of data (e.g., classifying, categorizing, segmenting, etc.) a priori cannot be performed.

Adding to this the rapidly widening gap between digital data production capabilities and network bandwidth capacity (at any scale), it becomes imperative to store the source data close to their point of production and only distribute across the network the data relevant to the task at hand.

Unfortunately, existing data management solutions (e.g., relational databases, non-relational databases, data stores, and other data collection techniques) have traditionally required static, pre-defined database structures, rules and schema that are created for the database when the database is established. As such, users requesting the data are limited to data access according to static schema (that may be outdated), from a database whose structure might be inefficient and costly. Additionally, updating the database structure, rules or schema in existing solutions requires re-starting the database from scratch.

Others have put forth efforts towards adaptive database systems. For example, United States issued U.S. Pat. No. 5,983,218 to Syeda-Mahmood is directed to modifying a relevance ranking of databases based query and response patterns for the databases. However, Syeda-Mahmood lacks any discussion of a modification of the databases themselves.

United States pre-grant publication number 2011/0282872 to Oksman, et al (“Oksman”) is directed to updating a system to increase the effectiveness of future queries. However, in Oksman, the system's updating is performed based on usage of query results or other feedback to the query results, rather than based on the results themselves. Similarly, United States pre-grant publication number 2012/0296743 to Velipasaouglu, et al (“Velipasaouglu”) is directed to updating a database based on a query and a user's activity following a query response.

United States pre-grant publication number 2007/0294266 to Chowdhary, et al (“Chowdhary”) is directed to using time-variant data schemas for database management based on database modification requests. However, in Chowdhary, the system simply stores new versions of schema stored along with older versions. Additionally, Chowdhary lacks any discussion regarding using query responses to generate new or updated versions of data schema.

All publications identified herein are incorporated by reference to the same extent as if each individual publication or patent application were specifically and individually indicated to be incorporated by reference. Where a definition or use of a term in an incorporated reference is inconsistent or contrary to the definition of that term provided herein, the definition of that term provided herein applies and the definition of that term in the reference does not apply.

The following description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art or relevant to the presently claimed invention, or that any publication specifically or implicitly referenced is prior art.

In some embodiments, the numbers expressing quantities of ingredients, properties such as concentration, reaction conditions, and so forth, used to describe and claim certain embodiments of the invention are to be understood as being modified in some instances by the term “about.” Accordingly, in some embodiments, the numerical parameters set forth in the written description and attached claims are approximations that can vary depending upon the desired properties sought to be obtained by a particular embodiment. In some embodiments, the numerical parameters should be construed in light of the number of reported significant digits and by applying ordinary rounding techniques. Notwithstanding that the numerical ranges and parameters setting forth the broad scope of some embodiments of the invention are approximations, the numerical values set forth in the specific examples are reported as precisely as practicable. The numerical values presented in some embodiments of the invention may contain certain errors necessarily resulting from the standard deviation found in their respective testing measurements.

Unless the context dictates the contrary, all ranges set forth herein should be interpreted as being inclusive of their endpoints and open-ended ranges should be interpreted to include only commercially practical values. Similarly, all lists of values should be considered as inclusive of intermediate values unless the context indicates the contrary.

As used in the description herein and throughout the claims that follow, the meaning of “a,” “an,” and “the” includes plural reference unless the context clearly dictates otherwise. Also, as used in the description herein, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.

The recitation of ranges of values herein is merely intended to serve as a shorthand method of referring individually to each separate value falling within the range. Unless otherwise indicated herein, each individual value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g. “such as”) provided with respect to certain embodiments herein is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention otherwise claimed. No language in the specification should be construed as indicating any non-claimed element essential to the practice of the invention.

Groupings of alternative elements or embodiments of the invention disclosed herein are not to be construed as limitations. Each group member can be referred to and claimed individually or in any combination with other members of the group or other elements found herein. One or more members of a group can be included in, or deleted from, a group for reasons of convenience and/or patentability. When any such inclusion or deletion occurs, the specification is herein deemed to contain the group as modified thus fulfilling the written description of all Markush groups used in the appended claims.

Thus, there is still a need for system that can dynamically adapt the structure, schema and/or metadata of its data archives.

SUMMARY OF THE INVENTION

The inventive subject matter provides apparatus, systems and methods in which responses (“extracts”) to requests against a data store are used to update a schema and/or structure of the database.

In some embodiments the data store is an archive of one or more sources of data. Archives might or might not be compressed, and might or might not include all of the data of the archived data sources, and might or might not be have the same structure as the data source(s). The archives can store data at full-fidelity (i.e., a reversible process, a bijection between source data and stored data). Among other things, archives can comprise one or more mirrors of the data source(s), collection(s) of data from the data source(s), as well as data from a sensor or other transient data source that would not otherwise be stored. Archives are typically considered to be write once-read many data stores, although it is contemplated that archives can grow by accruing data from additional data source(s).

Archives are preferably located logically proximate to their data sources, relative to end users, other archives or other intermediary network components.

The schema includes metadata about the archive. Contemplated metadata includes field names, data definitions, data types, access rules, traversal rules, strings used in executing historical extract requests, and statistical data regarding response data priority or other request patterns. Some or all of the metadata can advantageously be derived from requests, responses to the requests and/or processor, memory or other performance in executing the requests.

Embodiments can also include an analysis engine that performs the functions of updating the schema and/or structure of the data store. Contemplated updates include adding to, deleting and modifying the data definitions, data types or other metadata. Other contemplated updates include compressing or re-arranging at least part of the data store. Some or all of the updates to the schema can advantageously be derived from requests, responses to the requests and/or processor, memory or other performance in executing the requests.

Responses from the data store are preferably stored in a response repository, and at least a portion of the response repository can be published on a network, for access by all manner of authorized entities, including for example requesting entities, and analysis engines not closely associated with the data store.

Thus, the inventive subject matter can be used, for example, to provide full-fidelity storage of data while enabling end users to constantly evolving ways to access and explore the data and retrieve what they need in a network-efficient manner.

Various objects, features, aspects and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which like numerals represent like components.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 is a schematic overview of an exemplary system according to the inventive subject matter.

FIG. 2 is an overview of the information flow diagram within the exemplary system.

FIG. 3 is a flow diagram of the execution of processes and functions of the retrieval engine.

FIG. 4 is a flow diagram of the execution of processes and functions of the analysis engine.

DETAILED DESCRIPTION

Throughout the following discussion, numerous references will be made regarding servers, services, interfaces, engines, modules, clients, peers, portals, platforms, or other systems formed from computing devices. It should be appreciated that the use of such terms is deemed to represent one or more computing devices having at least one processor (e.g., ASIC, FPGA, DSP, x86-based CPU, ARM-based CPU, ColdFire-based CPU, GPU, multi-core processors, etc.) configured to execute software instructions stored on a computer readable tangible, non-transitory medium (e.g., hard drive, solid state drive, RAM, flash, ROM, etc.). For example, a server can include one or more computers operating as a web server, data store server, database server, or other type of computer server in a manner to fulfill described roles, responsibilities, or functions. One should further appreciate the disclosed computer-based algorithms, processes, methods, or other types of instruction sets can be embodied as a computer program product comprising a non-transitory, tangible computer readable media storing the instructions that cause a processor to execute the disclosed steps. The various servers, systems, databases, data stores or interfaces can exchange data using standardized protocols, interfaces and/or algorithms, possibly based on HTTP, HTTPS, AES, public-private key exchanges, web service APIs, known financial transaction protocols, or other electronic information exchanging methods. Examples of data exchange interfaces can include Ethernet, USB, HDMI, Bluetooth, wired, wireless, near-field communication interfaces, etc. Data exchanges can be conducted over a packet-switched network, the Internet, LAN, WAN, VPN, cellular, or other type of packet switched network.

One should appreciate that the disclosed techniques provide many advantageous technical effects including enabling the constant refinement of a data archive to decrease the computational cost of executing data requests against the archive.

The following discussion provides many example embodiments of the inventive subject matter. Although each embodiment represents a single combination of inventive elements, the inventive subject matter is considered to include all possible combinations of the disclosed elements. Thus if one embodiment comprises elements A, B, and C, and a second embodiment comprises elements B and D, then the inventive subject matter is also considered to include other remaining combinations of A, B, C, or D, even if not explicitly disclosed.

As used herein, and unless the context dictates otherwise, the term “coupled to” is intended to include both direct coupling (in which two elements that are coupled to each other contact each other) and indirect coupling (in which at least one additional element is located between the two elements). Therefore, the terms “coupled to” and “coupled with” are used synonymously. Within the context of a networking environment, the terms “coupled to” and “coupled with” are also used euphemistically to mean “communicatively coupled with” where two or more network-enabled devices are able to exchange data over a network with each other, possibly via one or more intermediary devices.

FIG. 1 depicts asystem100 that generally includes adata component110, a processing andcommunication component120 and a requestingcomponent130.

Thedata component100 generally includes at least onedata source111, at least onearchive112 at least initially derived from the at least onedata source111, and at least oneschema113 associated with the at least onearchive112.FIG. 1 illustrates an example whereby adata archive112 is derived from a plurality ofdata sources111, and has one correspondingschema113. However, it is contemplated that adata archive112 can have a plurality of associatedschema113. Likewise, it is contemplated that that the data archive112 can be derived from asingle data source111. Conversely, in the simplest case, there is onedata source111 that is archived at some point in time as asingle archive112, which is associated with asingle schema113.

From the perspective of the data archive112,data source111 can be any device from which the data used to create and update the contents of the data archive112 are received. Thus, generally speaking, data source(s)111 can include data storage devices (i.e. devices that store data obtained from other sources), data creation devices (i.e., devices that can generate data but do not store it), and combination data storage/generation devices (i.e., devices that can generate data, and store generated and other data). Examples ofdata source111 can include sensors (e.g. accelerometers, motion sensors, biometric sensors, temperature sensors, force sensors, impact sensors, flowmeters, GPS and/or other location sensors, pressure sensors, etc.), data storage devices (e.g., server computers, non-transitory computer-computer readable memory components such as hard drives, solid state drives, flash drives, optical storage devices, etc.), computing devices (e.g., desktop computers, laptop computers, tablets, phablets, smartphones, etc.), and user-input devices (e.g., devices that receive data from users, which can include computing devices with user-input interfaces).

Data archive112 can be considered to be a collection of data obtained from data source(s)111. The data archive112 can be embodied via at least one non-transitory computer readable storage medium that is configured to store the data of the data archive. In embodiments, the data in the data archive112 can be of the same data type as the data of data source(s)111. In embodiments, the data archive112 can import schemas, data definitions, and other data properties from the data source(s)111. In embodiments, the data in the data archive112 can be in the same format as the source data from source(s)111 (of the same or different data types).

In embodiments, the data ofarchive112 can be a full-fidelity version of the corresponding data from source(s)111. In these embodiments, a bijection exists between the source data and the archive data, such that the source data in its original form can be reconstructed or regenerated from the corresponding archive data.

In embodiments, the data archive112 can comprise text data, whereby the data from source(s)111 is converted to text data for inclusion in data archive112.

In embodiments, the data archive112 can be a mirror of the data source(s)111.

In embodiments, data archive112 can contain data in the form of attributes and tuples specified in an input schema. The input schema can be considered to be a default schema used in the creation of thearchive112. Thus, the data archive112 is the primary physical instantiation (data written to storage) of data in thesystem100.

In these embodiments, the data archive112 can be described as a tabular data structure. Attributes are generally synonymous with columns, whereas records or rows are generally synonymous with tuples. A tuple can represent one value for each attribute in an archive relative to an ordered or fixed rank within each attribute. The rank may be relative to an ordering of some or all attributes and can be defined in the input schema. Alternatively, collection of data within thearchive112 can default to a rank based on structure and order of data received.

Generally speaking, theschema113 can be considered to be the structure of the data archive112, providing the organization of the data within the data archive112. Aschema113 can include a definition of fields, data types, record delimiters, classes, relationship between data, compression rules, etc. Theschema113 can include performance metric thresholds for the execution of requests. The performance metric thresholds can be according to sections of data, particular records, request types, etc. The performance metric thresholds can include targets for the execution of requests (e.g., time to completion, processor load, etc.), acceptable tolerances, etc. The thresholds can be dynamically adjusted based on factors such as network capability (overall and/or at a particular point in time), identity of requestor, the frequency of the data being accessed, etc. For example, for data that is frequently accessed, the acceptable performance metric thresholds can be set to be more strict (i.e., only a slight drop in access speed is permissible).

As mentioned above, adata archive112 can include a default (i.e., input)schema113 providing an initial structure of the abstract to which initial requests to thearchive112 are applied to generate the initial responses. For example, in an otherwise completely unstructured and otherwise undefined data archive, a default schema can be the designation of record boundaries within the data archive which serve to provide an initial structural organization to the data. These boundaries can be defined a priori by an administrator, or can correspond to known properties of data fromsource111. This example is provided for illustrative purposes, and it is contemplated that default schema can include other and/or additional structural definitions, classifications, categorizations, etc. It is also contemplated that the data archive112 can lack any default schema whatsoever, wherein an initial schema can be constructed via parsing and applying pattern recognition and rule-based algorithms to the data archive112 by a processor (such as processor121).

In embodiments, the pattern recognition and rule-based algorithms can be applied to the default schema, thus providing an initial step of evolution to the basic default schema. In an illustrative example, an archive can be created from sensor data, whereby the archive is created based on a simple read of the sensor data. In this example, the sensor data is loaded into an archive in tabular form where a row of values are associated with a timestamp and sensor output (e.g., “Aug. 4, 2014 17:08:35, 75.9, 234.8, . . . ”). The initial archive schema can be the known qualities of the sensor data, which in this example is the names associated with the field of the rows—“Date”, “Time”, “Pump Temperature”, “Pump Voltage”, etc. Upon applying pattern recognition on multiple roles (preferably tens, hundreds, or even thousands or rows), aspects of the sensor data can be discovered to be periodic in nature. In this example, a voltage variation can be observed as having a baseline and a periodic multi-harmonic signal. Thus, a curve (e.g., a wave) having a mathematical equation can be derived from the voltage data. This default initial schema can be updated to incorporate this discovery (e.g., via a set of parameters for a Fourier transformation (FFT)). Thus, a new set of operators can be applied to the data in processing user requests that were previously unknown. When users access data, they can request data associated with the harmonic signal part of the voltage data, whereby the processing component can use this updated schema to perform the operations necessary to remove the baseline voltage data from the generated responses, when the field is referenced in future requests. However, the underlying voltage data in the archive is not modified or transformed at the physical/persistent level.

Theschema113 can include metadata associated with thearchive112, such as system metadata and archive metadata. System metadata contains system configuration, performance, and consistency information that is both created and updated. This metadata can be published and shared with other systems. For example, metadata structure shared with other systems (i.e., nodes in the larger ecosystem made up of a plurality of systems of the inventive subject matter) can be updated to record the creation of a new archive through a messaging service. Archive metadata contains statistical observations and indices calculated during storage engine operations including compression and write. This metadata can be updated along with other aspects ofschema113.

The processing andcommunication component120 generally includes at least oneprocessing component121,memory122, aretrieval engine123, ananalysis engine124, and acommunication component125.

Theprocessing component121 can be one or more computer processors that execute instructions to carry out functions and processes associated with the inventive subject matter.

Theretrieval engine123 performs the functions associated with obtaining data from the data archive112 in response to the extract request received from a requestor, and providing the data in the form of a response back to the requestor, described in further detail herein. Theanalysis engine124 performs the functions associated with modifying theschema113 and/or thearchive112 based on the response, theschema113 and, in embodiments, the request. The functions and processes executed byanalysis engine124 are described in further detail herein.

In embodiments, theretrieval engine123 andanalysis engine124 can each comprise set of computer-executable instructions that are executed by processingcomponent121 to carry out their respective functions. In these embodiments, theretrieval engine123 andanalysis engine124 can be a single engine having the functions of both or be separate engines, and can be stored in either the same or separate non-transitory computer-readable media and can be executed by the same or different processing component(s)121.

Communication component

125 can include any communication interface enabling theprocessing component120 to exchange data with theuser interface132, via one or more data exchange networks, examples of which include the data exchange interfaces, protocols and/or networks discussed herein.

The requestingcomponent130 generally includes at least one requestingentity131 and at least oneinterface132. The requestingentity131 can be considered to be the entity that initiates the request for data fromarchive120, via theinterface320. The requestingentity131 can be a single user (as illustrated in the example ofFIG. 1), a group of users, an organization, an enterprise, etc.

User interface

132 is an interface via which a requestingentity131 can submit requests to access data contained indata archive112. Theuser interface132 is presented to the user via a computing device, through which the user can create the requests. Theuser interface132 can be a web-based interface hosted by an administrator of data archive112, accessible via an internet browser on the computing device, an application executed on a requesting entity's131 computing device, etc. As used herein, the term “user interface” can be considered to refer to the software application as well as the computing device used to present the interface to the user and that enables the user to create requests.

As used herein, the term “requestor” is used to refer to theinterface132 as the origin of the request, created according to the requestingentity131. Thus, the term “requestor” may or may not include requestingentity131 but always includes theinterface132.

As illustrated inFIG. 1, the groupings of system components into thedata component110, the processing andcommunication component120, and the requestingcomponent130 are provided for illustrative purposes according to the various functions of the system components according to aspects of the inventive subject matter. Thus, the illustrated “grouping components”110,120,130 are not intended to limit or define the contemplated physical embodiments of thesystem100.

In embodiments, the data archive112 is at relatively close data proximity from one or more of the data source(s)111, as compared to the requestingcomponent130. The term “data proximity” is intended to refer to the relative difficulty in transmitting data from a sender to a recipient, which can be influenced by factors such as physical proximity, size of data being transmitted, network capacity between the sender and receiver, number of intermediary nodes between the sender and receiver, and other factors. Thus, in these embodiments, the data archive112 is communicatively coupled to thedata sources111 in such a way that the exchange of data from the data source(s)111 to the data archive112 is significantly faster than the exchange between data archive112 and requestingcomponent130. This can be due to factors such as the data archive112 being in closer physical proximity to the data source(s)111 than to requestingcomponent130, that the network capacity between thearchive112 anddata sources111 is greater than that of the network betweenarchive112 and requesting component130 (e.g., greater bandwidth, better-optimized network connection, etc., less intermediary nodes slowing down data exchange, fewer bottlenecks, etc.), the size of information sent byindividual data sources111 to archive112 is smaller than the sending of all of thearchive112 to the requesting components, etc., or a combination of these factors.

Thesystem100 can also include an operator interface (not shown) that allows an operator (e.g., a system administrator or other personnel having control over the data archive) to perform administrative and other service-related functions over the various aspects of the system. The operator interface can include one or more computing devices communicatively coupled to various components of thesystem100. An operator can use operator interface to oversee the creation and loading of data into adata archive112, manage archive resources and computing environment, manage access control and security functions, etc. For example, an operator can trigger a manual alteration the nature of the fidelity of the data archive (potentially losing information). This alteration can be applied on an archive history basis, and is irreversible. However, the operator may elect to do so to conserve storage space when faced with storage constraint issues.

FIG. 2 provides an overview of the data flow processes ofsystem100, according to aspects of the inventive subject matter.

As shown inFIG. 2, the data archive112 is initially created fromdata source111, illustrated via arrow210. Anextract request220 to access data is generated byrequestor130 and transmitted toprocessing component120.Processing component120 receives theextract request220 viacommunication component125.

Retrieval engine

123 executes the receivedextract request220 against data archive112 according to theschema113. Once theextract request220 has been executed, theretrieval engine123 assembles the results in the form of response230 (also referred to as extract response230). Once theresponse230 has been generated, it is transmitted back to therequestor130.FIG. 3 is a flow chart of the processes executed byretrieval engine123 in greater detail.

Once theresponse230 has been generated,analysis engine124 analyzes theresponse230 and performs anupdate240 to at least one of (a) theschema130 associated with thearchive112 and (b) the structure of the data archive112 itself based on theresponse230.FIG. 4 illustrates data flow processes associated with ananalysis engine124 in greater detail.

As shown inFIG. 3,retrieval engine123 receives therequest220 atstep310. Atstep320, theretrieval engine123 applies therequest220 to theschema113 to determine the extent to which the information sought in therequest220 is defined by fields or other structural organizational scheme withinarchive112. In embodiments, theretrieval engine123 can include data access control functions whereby credentials ofrequestor130 are verified prior to allowing any access to thearchive112. The access control functions can include verification of the identity of the requestingentity131, verification a network address, authentication procedures (e.g., passwords, encryption schemes, certificates from an authority, etc.), role-based authentication/verification (e.g., a role within an organization, etc.), and other forms of access control.

In embodiments, therequest220 can be formatted to include all of the fields (or other structural categorization) sought in the data request. For example, therequest220 can include one or more extract request parameters in the format “field name=field value”. Thus, theretrieval engine123 matches each of the field names in the extract request parameters with the field names defined by theschema113 for thearchive112. Other extract request parameters can include data type, data size, length, etc. Extract request parameters can also be combinations of single parameters. For example, in the “field name=field value” example, therequest220 can also specify that the “field value” be of a certain data type (e.g. string, integer, etc.), have a certain maximum or minimum length, etc.

In embodiments, therequest220 can be formatted according to a natural language question, in which case theprocessing component120 can include a semantic database to determine that therequest220 is focusing on a particular set of fields, and then compare the fields derived from therequest220 against the defined fields according toschema130.

Atstep330 theretrieval engine123 executes the operations associated with carrying out therequest220 according to the field values of the defined fields of theschema113, such as filtering the records of thearchive112 such that the output of the operation is those records matching the field values of the defined fields.

It should be noted that arequest220 can include a request for data whose field (in this example), data type, data format, data definition or other organizational/structural parameter is not defined or otherwise known in the data archive112 according to theschema130. For these unknown fields, the matching performed atstep320 with known/defined fields as set forth in theschema113 will fail to produce a match, and can be flagged or otherwise identified by theretrieval engine123 as unknown fields. Atstep340, theretrieval engine123 executes a matching of the “field value” of extract request parameter with thearchive112 to determine whether the field value corresponds to any part of any record within thearchive112. The match can be a literal (i.e., exact) match or can be a proximity match (i.e., matching within a defined percentage of similarity).

In embodiments, any matches can be analyzed to infer other characteristics of the matched data. For example, for every match of a field value, theretrieval engine123 can determine the data type of the field value (e.g., that the match is a string, integer, etc.).

Atstep350, theretrieval engine123 performs the operations on the archive112 (e.g., filtered or otherwise processed) according to these matches to return the data output used in generating theresponse230.

As illustrated inFIG. 3,step340 is executed after the processes ofstep330 are executed. Thus, the matching of the field value of the extract request parameter of the unknown field is limited to those records returned from the filter processes performed with the defined fields ofarchive112. However, in embodiments, step340 can be executed prior to step330, whereby the matching of the values of unknown fields can be performed against all of the data within thearchive112. Therefore, steps330 and350 of these embodiments are effectively combined.

Atstep360, theresponse230 is generated based on the output of the execution of the extract request, and provided back torequestor113.

Theresponse230 can be considered to be a view of thearchive112 presented to the requestingentity131 via theuser interface132. In embodiments, theresponse230 can be a set of scalar expressions (e.g., scaling and compare expressions, etc.) that define the set of data in the archive (or a projection/subset thereof) that correspond to the data requested by therequestor130.

In embodiments, the set of scalar expressions can include clauses that describe the Projection, Function and Filter type. A Projection can be considered as a selection of a sub-set of data (such as a subset of columns of all available columns, if the archive is so structured). The Projection can also be used to modify an attribute's value. The Function can be a modification of attributes within the projection. Examples can include scalar mathematical functions such as addition and subtraction. The Filter can conditionally restrict tuples within the defined Projection. In embodiments, the Function can be applied to either a Filter or a Projection.

The following are examples of arequest220 executed ondata archives112 having different levels of schema definition, according to the process described inFIG. 3. Therequest220 in this example is seeking data associated with males aged 35-40 years old, living in the city of Orange, Calif. Thus, therequest220 is considered to include the following “field=value” parameters: ‘gender’=‘male’, ‘age’=‘35-40’, and ‘city’=‘Orange’.

Example 1

Theextract request220 is applied to adata archive112 having established, known data fields for all of the extract request parameters (gender′=‘male’, ‘age’=‘35-40’, and ‘city’=‘Orange’) of therequest220. For each record in thearchive112, there are defined fields corresponding to “gender”, “age” and “city”. Correspondingly, theextract request220 is formatted according to these known fields ofarchive112. Thus, theretrieval engine123 executes theextract request220 and filters the data in thearchive112 according to the gender, age, and city fields. In this example, there are no “unknown” fields in therequest220, so

steps

340 and350 ofFIG. 3 are not executed. This result is then used to generateresponse230 atstep360.

Example 2

Thesame request220 from Example 1 is applied to a “less established” archive112 (i.e., theschema113 is less established), where some, but not all, of the fields corresponding to the parameters in therequest220 are known/defined in thearchive112. In this example, theschema113 includes defined “gender” and “age” fields, but does not have a defined “city” field forarchive112. Having determined the defined fields at step302 and executed the functions according to those defined fields atstep330, theretrieval engine123 executesstep340 and searches within the results ofstep330 for the literal match “Orange” (in this example, a literal match is preferred because the city name will not have a plural or other conjugation). Once the matches are obtained, the processes ofstep350 are executed and the response generated atstep360. In the embodiments wherebystep340 is executed prior to step330 as described above, theretrieval engine123 performs match of “Orange” against theentire archive112.

Example 3

In this example, therequest220 is applied to an even less “established” archive112 (i.e., having an even less established schema113), where none of the field names are known, such that none of the fields contained in therequest220 will match with corresponding fields ofarchive112. In this example, theschema113 can include other defined fields (but none that match therequest220's fields), or can have no defined fields of any kind Thus, the only “knowns” are the record boundaries defined by theschema130. In this case, the execution ofstep320 will not return any defined fields. Thus, theretrieval engine123 executes the matching ofstep340 for literal match for “male”, a literal integer match of “35-40” (and can include matches of each integer 35, 36, 37, 38, 39, 40), and the literal match of “Orange”.

Example #4

This example is similar to Example 3, but the record boundaries are also not “known”. As defined herein, a record boundary indicates a beginning and an end of each record (e.g., a row in a spreadsheet, etc.). In other words, the data archives only has one dimension (e.g., flat) including a long single string of data. In this example, theretrieval engine123 searches theentire archive112 for matches of the field values in therequest220. Based on the matches, theretrieval engine123 can infer record boundaries accordingly by performing pattern analysis on the matched results (e.g., periodicity of repeating matches, and the distance between the repeating matches, taking into account that not all field values will match in all records, etc.).

Having inferred the record boundaries, an offset can be determined for each match from what are inferred to be separate records, to account for possible different field value lengths among a same field type, class of literals (e.g., male, male, female, male), or data type (e.g., integer, floating point, etc.). For example, “male” may return matches that in fact are for “female”. However, because a match of “male” within “female” will have an offset of two corresponding to the “f” and “e” characters, these results can be eliminated as false positives for the purposes of generating theresponse230.

In embodiments, the record boundaries inference can be executed by theretrieval engine123 even if record boundaries/delimiters exist withinarchive112, such that corrections and adjustments to previously determined record boundaries can be performed over time as thearchive112 is accessed by additional requesters and additional responses are provided thereto.

FIG. 4 illustrates data flow processes associated with ananalysis engine124, whereby theanalysis engine124 uses theresponse230 to modify theschema113 and/or the structure of thearchive112 itself. The functions and processes ofFIG. 4 can be considered collectively to be the functions and processes available to the system to perform one ormore updates240 ofFIG. 2.

Atstep410, theanalysis engine124 accesses the generatedresponse230 and any additional corresponding data (if not included within theresponse230 itself). The use of “accessing” the generatedresponse230 is intended to refer generally to the logical step that the generatedresponse230 becomes available to theanalysis engine124 for the purposes of carrying out its associated functions. As such, theanalysis engine124 can also be considered to be “receiving” the generatedresponse230. For example, the generatedresponse230 can be accessed by theanalysis engine124 prior to transmission as a logicalstep following step360, constituting a logical “hand-off” of theresponse230 from theretrieval engine123 to theanalysis engine124. In another example, a copy of the generatedresponse230 is generated by theretrieval engine123 and provided to theanalysis engine124 such that the functions of theanalysis engine124 can be performed chronologically independent of the actual transmission of theresponse230. In another example, in embodiments where theretrieval engine123 andanalysis engine124 are part of a single engine, “accessing” can refer to the invocation of the functions associated within theanalysis engine124 and applied to the response230 (or a copy thereof).

Theresponse230 generated atstep360 ofFIG. 3 can include performance metrics associated with the retrieval of the requested data and the generation of theresponse230. Performance metrics can include a time to complete the request, a resource load indication (e.g., processor usage, energy usage, memory usage, etc.), and other performance metrics. In embodiments, the performance metrics are a part of theresponse230 that is provided to therequestor130. In embodiments, the performance metrics are generated atstep360 along with, but are separate from, theresponse230. In these embodiments, the performance metrics can be provided to theanalysis engine124 along with theresponse230 without also having to provide the performance metrics to therequestor130.

Atstep420, theanalysis engine124 modifies thearchive112 andschema113 according to any new archive structure component (e.g., new record, field, data type, delimiter, etc.) reflected in theresponse230. Step420 can be executed according to steps421-422.

Atstep421, theanalysis engine124 applies any new record delimiters to theschema113 based on the location of the record delimiters inferred by the retrieval engine123 (such as in Example 4 above).

The modifications toschema113 can include data the location data within thearchive112 of the created record delimiters. Additional updates to theschema113 can include a determination on the sizes of newly established records. Modifications to thearchive112 itself can include insertion of record delimiters or other record boundaries at the corresponding locations according to the updatedschema113, as well as modifying the newly established records for consistency with a desired record structure (e.g. inserting or removing spaces, lines, etc. to organize the records within the archive112). If record delimiters already exist inarchive112, or do not require adjustment, theanalysis engine124 can skipstep421.

Atstep422, theanalysis engine124 applies any new fields, data types, data definitions, or other intra-record structural parameters/definitions to theschema113. This can include the determined locations within corresponding records and/or theoverall archive112 and any correlations with other structural parameters (e.g., a particular field name also has values of a particular data type). As with the record delimiters, intra-record structural delimiters can similarly be applied to thearchive112 itself.

Thus, in Example 2 discussed above, theschema113 is updated to include a “city” field name, at the appropriate locations within thearchive112, which can be an established offset from the record boundaries for each record. This then aligns data in records as corresponding to a “city” field, even in records that did not contain the “Orange” match. Additionally, if theschema113 has been updated to include that a correlation of a “city” field as having values of a “string” or “text” type, theanalysis engine124 can analyze the non-matched records to verify that the same data type exists in those records, and the size of the records. Theanalysis engine124 can then update rules associated with the expected (or permitted, maximum, or minimums, etc.) size of field values for cities within theschema113.

In embodiments, the modifications to thearchive112 can be performed by theanalysis engine124 as soon as the modifications to theschema113 are performed. In other embodiments, the modifications to thearchive112 are applied only when asubsequent request220 from a requester130 (either the same requestor or a different one) is executed. Thus, thearchive112 itself is only modified with the delimiters and other modifications at run-time.

In addition to the schema modifications discussed above, modifications to theschema113 can include updating theschema113 to reflect observed characteristics of thearchive112 as a whole. Examples of these characteristics can include recognitions of periodicity, decay, etc., such as the example of the sensor data illustrated above.

Preferably, requestors can only perform extraction actions against the archive (i.e., requests for access to data and receive responses). While the archive and/or the schema can be modified based on the response provided to the requestor, a requestor cannot directly modify the archive data itself. However, to allow requestors to narrow or filter data presented via a response, requestors can mark (e.g. via annotations, flags, etc.) data at any granularity and maintain those markings for any length of time. The markings can be used by the system to keep track of data not deemed relevant in the temporal semantic view of the archive as seen by the requestor. In embodiments, the markings can be reintroduced into the system by interpreting them as new requests, which can be constructed onto the extracted response or as a new request combining the prior request and the “marking” request.

In embodiments, certain modifications to thearchive112 can be labeled as “provisional” modifications withinschema113, such that they are considered preliminary or subject to further modification. Provisional modifications can include modifications that have been recently created (e.g., within a certain number of requests processed against the archive112). After provisional modifications have withstood a pre-determined number of requests and responses processed against thearchive112, those modifications can be made permanent (i.e., serving to confirm that the fields, data definitions, types, record boundaries, etc. are valid).

Atstep430, theanalysis engine124 analyzes performance metrics ofresponse230 against the performance metrics thresholds of theschema113. If the performance metrics of theresponse230 exceed or otherwise fall outside of desirable or acceptable thresholds as indicated byschema113, theanalysis engine124 can execute changes to the data archive112 to reduce the computational cost or load to execute requests on similar data in the future atstep431. For example, theanalysis engine124 can move records within thearchive112 such that the records are in a location within thearchive112 that is more quickly accessed during the execution of a retrieval process. In another example, theanalysis engine124 can modify theschema113 such that the filter order among several fields is optimized.

In embodiments, therequest220 can be re-executed by the retrieval engine123 (at step432) after the changes ofstep431 are implemented, and the new performance metrics compared against the performance metrics of theresponse230 to verify that the performance of executing the extract request has improved.

Atstep440, theanalysis engine124 can compress sections of data within thearchive112. For example, sections of data that remain unknown (either within records our between records) can be compressed. In another example, defined sections of data (of a particular field type, data type, data definition, etc.) can be compressed to take advantage of commonalities and redundancies.

InFIG. 4, the processes420-422,430-432 and440 executed byanalysis engine124 are shown in parallel to show that they can be executed concurrently by theanalysis engine124. However, it is contemplated that the processes can be executed in series in the numerical order of the elements described inFIG. 4, or in other sequential orders. Additionally, the extent to which thearchive112 and/or theschema113 can be modified performed via the processes shown inFIG. 4 can be governed by priority rules. For example, a reorganization of a record within thearchive112 such as instep431 may be limited or outright rejected if superseded by a higher-priority rule (such as the movement of the record pushes other records “down” that are frequently requested by users and thus, must be maintained at the “top” of the accessibility list).

It is contemplated that the modifications to thearchive112 andschema113 described inFIG. 4 can also be enacted based on a collection of historical responses generated in response to historical extract requests submitted by one or more requestors in the past. Additionally, theresponse230 can be added to the collection of historical responses. Metadata ofschema113 can statistical data and analysis including request patterns (e.g. from which requesters, how frequently, which data has been accessed, etc.). As such, theanalysis engine124 can establish and update data access priorities for certain sections of data within theschema113. Historical responses, requests, and other data can be stored in a response repository, which can bememory122 ofcomponent120, the same storage as thearchive112, or another, separate non-transitory computer-readable medium.

In embodiments, theanalysis engine124 can also modify thearchive112 and/or theschema113 based in part on the receivedrequest220. Similarly, where historical responses are used to shape thearchive112 andschema113, so can historical requests be used.

In embodiments,system100 will be a part of a larger ecosystem having other, similar systems with corresponding archives generated based on corresponding data sources. In these embodiments, thesystem100 can also include a publication module that can publish the existence of thearchive112 to other systems in the ecosystem. It is further contemplated that theschema113, metadata within theschema113, and other characteristics of thearchive112 can be published. Similarly, the response repository (which may or may not include a collection of requests), can be published via the publication module.

In embodiments, theresponse230 can be a data stream. In these embodiments, the steps ofFIG. 4 can be applied as the data stream is occurring, based on data being transferred as needed during the stream. For example, theanalysis engine124 can modify thearchive112 and/orschema113 based on the response cost during the data stream. For data streams, the modifications to thearchive112 and/orschema113 can be transitory for the duration of the stream so as to provide the immediate benefit of the modifications. In embodiments, these modifications can also be made permanent as described herein.

It should be apparent to those skilled in the art that many more modifications besides those already described are possible without departing from the inventive concepts herein. The inventive subject matter, therefore, is not to be restricted except in the spirit of the appended claims. Moreover, in interpreting both the specification and the claims, all terms should be interpreted in the broadest possible manner consistent with the context. In particular, the terms “comprises” and “comprising” should be interpreted as referring to elements, components, or steps in a non-exclusive manner, indicating that the referenced elements, components, or steps may be present, or utilized, or combined with other elements, components, or steps that are not expressly referenced. Where the specification claims refers to at least one of something selected from the group consisting of A, B, C . . . and N, the text should be interpreted as requiring only one element from the group, not A plus N, or B plus N, etc.

Claims

What is claimed is:

1. A system comprising:

a physical data archive;

a schema that provides metadata with respect to the data archive;

a retrieval engine executed by a processor configured to generate an extract response from the archive, the schema and an extract request; and

an analysis engine configured to analyze the extract response to automatically update at least one of (a) the schema and (b) a structure of the archive.

2. The system ofclaim 1, wherein the analysis engine is further configured to update the schema based at least in part on historical responses.

3. The system ofclaim 1, wherein the analysis engine is further configured to update the data archive based at least in part on historical responses.

4. The system ofclaim 3, wherein the updating the archive comprises compressing at least a portion of the data archive.

5. The system ofclaim 1, wherein the metadata comprises information about the archive derived from historical extract requests.

6. The system ofclaim 1, wherein the metadata comprises information about the archive derived from processor performance in executing historical extract requests.

7. The system ofclaim 1, wherein the metadata comprises strings used in executing historical extract requests.

8. The system ofclaim 1, wherein the metadata comprises request patterns.

9. The system ofclaim 1, wherein the analysis engine is further configured to use the response to update the schema to identify a data definition of a portion of the data archive.

10. The system ofclaim 1, wherein the analysis engine is further configured to use the response to update the schema to identify a data type of a portion of the data archive.

11. The system ofclaim 1, wherein the data archive comprises a mirror of a data source.

12. The system ofclaim 1, wherein the analysis engine is further configured to:

store the response in a response repository; and

publish at least a portion of the response repository on a network.