This application is a continuation of U.S. patent application Ser. No. 16/832,041, filed Mar. 27, 2020, now U.S. Pat. No. 11,500,895, which is herein incorporated by reference in its entirety.
The present disclosure relates generally to data pipelines for transferring batch and streaming data via communications networks, and more particularly to methods, computer-readable media, and apparatuses for configuring data pipeline components for delivering a first data set to at least a first destination and for delivering a second data set to at least the second destination in accordance with a plan comprising a combination of a first information model associated with a first request and a second information model associated with a second request and including at least one modification to the combination.
BACKGROUNDA data pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next. The elements of a data pipeline may operate in parallel or in a time-sliced fashion. In addition, some amount of buffer storage may be provided between other elements. One subset of data pipelines includes extract, transform, and load (ETL) systems, which extract data from a data source, transform the data, and load the data into a database or data warehouse. ETL pipelines may run in batches, meaning that the data is moved in one large chunk at a specific time to the target, e.g., in regular scheduled intervals. A data pipeline is a broader term that refers to a system for moving data from one or more sources to one or more targets in a computing network environment. The data may or may not be transformed, and it may be processed in real time (or streaming) instead of batches. When the data is streamed, it may be processed in a continuous flow which is useful for data that is continuously updating, such as a data from a traffic monitoring sensor. In addition, the data may be transferred to any number of targets, which may include databases or data warehouses, as well as any number of automated systems, operator/user terminals, and so forth.
SUMMARYMethods, computer-readable media, and apparatuses for configuring data pipeline components for delivering a first data set to at least a first destination and for delivering a second data set to at least the second destination in accordance with a plan comprising a combination of a first information model associated with a first request and a second information model associated with a second request and including at least one modification to the combination are described. For example, a processing system including at least one processor may obtain a first request for a delivery of a first data set to at least a first destination, map the first request to a first information model of a plurality of information models, obtain a second request for a delivery of a second data set to at least a second destination, map the second request to a second information model of the plurality of information models, and identify that at least a portion of data is a part of both the first data set and the second data set. The processing system may next determine a plan for configuring data pipeline components for delivering the first data set to the at least the first destination and for delivering the second data set to the at least the second destination, where the plan comprises a combination of the first information model and the second information model, and where the plan comprises at least one modification to the combination of the first information model and the second information model. The processing system may then configure the data pipeline components for delivering the first data set to the at least the first destination and for delivering the second data set to the at least the second destination in accordance with the plan.
BRIEF DESCRIPTION OF THE DRAWINGSThe teachings of the present disclosure can be readily understood by considering the following detailed description in conjunction with the accompanying drawings, in which:
FIG.1 illustrates an example network related to the present disclosure;
FIG.2 illustrates a flowchart of an example method for generating a data schema for a type of data pipeline component and storing an ontology and the data schema for the type of data pipeline component in a catalog of data pipeline component types;
FIG.3 illustrates a high level block diagram of a computing device specifically programmed to perform the steps, functions, blocks and/or operations described herein;
FIG.4 illustrates examples scenarios of data blending for different data pipelines, in accordance with the present disclosure; and
FIG.5 illustrates a flowchart of an example method for configuring data pipeline components for delivering a first data set to at least a first destination and for delivering a second data set to at least the second destination in accordance with a plan comprising a combination of a first information model associated with a first request and a second information model associated with a second request and including at least one modification to the combination.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures.
DETAILED DESCRIPTIONExamples of the present disclosure include a system for data pipeline configuration and management, which may be referred to as a data pipeline controller, or a Data Pipeline Intelligent Controller (DPIC). A data pipeline controller may control all the elements of a data pipeline to enable the data pipeline to create a suitable response to satisfy a client request. The functions, or modules of a data pipeline controller may include, but are not limited to: schedulers, request interpreters, various artificial intelligence/machine learning modules, policy functions, security and privacy enforcement modules, assurance functions, negotiation functions, orchestrators, databases, an abstract symbol manipulator module, a model data schema generator/updater, and so forth.
In one example, a data pipeline controller of the present disclosure may create new schemas to handle new source data retrievals and/or to integrate new data pipeline component types, and may assemble and tear down data pipelines in real-time. In one example, a data pipeline controller is flexibly expandable via add-ons, plug-ins, helper applications, and the like. When a client, such as a data scientist, a network operator, or the like seeks to obtain specified data sets from multiple sources, e.g., to provide to one or more machine learning models as target(s), the client may provide the request by specifying the desired data and the desired target(s), and the data pipeline controller may automatically generate an end-to-end plan to obtain and transmit the right data from the right source(s) to the right target(s). Thus, the present disclosure provides for intelligent control of data pipelines via a data pipeline controller that automatically integrates and directs data pipeline components at a higher level of abstraction. Data pipelines may be constructed dynamically, and on an as-needed basis such that even complex or demanding client requests may be fulfilled without (or with minimal) human interaction, and without component-specific human expertise regarding the various data pipeline components.
In many cases, a data pipeline or its associated support functions are in existence but the data pipeline itself may be inactive. In other cases, a data pipeline may not be physically or virtually established, but all the support functions are available in the cloud. In response to a request for data transfer, examples of the present disclosure may activate an inactive data pipeline or may form a new data pipeline in real-time. Examples of the present disclosure may further include features for: security, access, authentication, and authorization (AAA), (for instance, a requestor may not have the right to a data set; the present disclosure may take the role to gain rights for protected data set(s)), accounting services, proxy creation, protocol setting, payment settlement, and so on.
In one example, a data pipeline component discovery module of a data pipeline controller continuously discovers new or changed conditions in a data pipeline infrastructure. In one example, the data pipeline controller may determine how to fulfill data requests with alternative mechanisms. For instance, the data pipeline controller may determine if intermediate nodes or data stores could be established to improve efficiency or other performance/quality aspects. In one example, a result of a request may be stored as a copy in a source node, in a specified intermediate node, or at one or more target nodes, such that the result may be reused for one or more subsequent requests. The purpose is not to replace the data pipeline's native data fulfillment functions, but rather to assist, suggest, or command how the data pipeline handles its fulfillment aspects.
Examples of the present disclosure also ensure that data is well understood. For instance, data sources may be indexed and a requestor may learn upfront what data is available. In accordance with the present disclosure, a data pipeline may be dynamically established and subsequently torn down. Thus, a data pipeline may not always be a persistent entity. In one example, a data pipeline controller of the present disclosure is aware of each data pipeline that is in existence, and knows each data pipeline's history. In addition, in one example, if a request cannot be automatically satisfied, the data pipeline controller may provide meaningful explanation of the gaps, which may allow data scientists working offline to improve tools/modules at the data pipeline level.
A data pipeline controller of the present disclosure and/or various modules thereof may be configured for several use patterns, e.g., including but not limited to: inquiry/browsing, request template/specification and analysis/planning, data source/data pipeline indexing, notification, and request and fulfillment. Interactions of the data pipeline controller with other entities in these patterns may be via any appropriate means, such as: direct or indirect communications; forwarded, routed, or switched communications; application programming interfaces (APIs); bus messages; subscribe-publish events; etc.
Inquiry/browsing—This pattern may be used to verify if a data pipeline controller can arrange the fulfillment of an inquiry. For example, a requestor may browses a data pipeline catalog to select particular data or data set(s), and may send an inquiry to the data pipeline controller, which may then determine and respond with availability (and potentially commitments, reservations, verifications, etc.) along with associated information related to the data/data set(s) that is/are identified in the inquiry, such as: estimated freshness, latency, quality, etc.
Request template/specification and analysis/planning— A requestor may send an actual request to the data pipeline controller for simulated processing, such as a particular template/specification of desired data or data set(s). The data pipeline controller may command and coordinate with data pipeline components to perform analysis, search, planning of functional steps, and so forth, in order to provide informative responses. For example, in some cases the data pipeline controller may return one or more of three potential responses: (1) requesting that more information should be provided, (2) indicating that special authorization may be needed, and (3) providing example(s) of full data/data set response (if possible) or partial data/data set response (e.g., if the requested data/data set(s) is/are large, if “1” or “2” also apply, etc.).
In one example, information models may have associated request templates which may be predefined (e.g., by a creator/administrator of an information model) and/or which may be learned over time as requests are matched to different information models, as feedback on the quality and correctness of the matching is provided by client request submitters, and so forth. In one example, multiple request templates may be stored and maintained in association with an information model. For instance, the same information model may be matched to different requests, which may all relate to a same general type of data delivery, but with somewhat different specifics, such as one or more different data sources, one or more different targets, with or without an intermediate storage node, etc.
It should be noted that information models and associated request templates may have more or less detail, and more or less fixed and/or configurable parameters depending upon the preferences of a system operator, a creator of an information model, etc. For instance, in one example, an information model and/or an associated request template may be for obtaining specific data from specific data sources and delivering to selected targets. In other words, the data and data sources may be fixed and are not further selectable (at least with this particular information model). However, another information model may be for obtaining selectable data from selectable data sources within a specific area for delivery to selectable targets. In other words, the location or region may be fixed, while the data and the data sources are not fixed and can be selected (e.g., via a request that is crafted in accordance with an associated request template and/or via a custom crafted request that is mapped to the information model). In one example, the request template/specification and analysis/planning use pattern may include providing access to a catalog of request templates from which a client may select a template for use (e.g., for simulated or actual fulfillment of a request).
Data source/data pipeline indexing—The data pipeline controller may add new data sources (or even full data pipelines) to a catalog of data pipeline infrastructure components.
Notification—The data pipeline controller may notify requestors and/or subscribers of new data pipeline components or data pipeline component types. For instance, when a new data pipeline component or data pipeline component type is discovered, the data pipeline controller may notify previous requestors and/or publish/post notifications to those who previously subscribed to the notification messages (e.g., of the particular scope of the new findings).
Request and fulfillment—Stored data set(s) or stream data may be obtained by a requestor or an automated system sending a request or trigger to the data pipeline controller. The request/trigger may be simple in some cases, but may be expected to include (directly or by reference) detailed specification information such that the appropriate data or data set(s) can be identified, prepared, and provided. In one example, the data pipeline controller may first check if the same or a similar request has recently gone through the request template/specification and analysis/planning pattern (e.g., as outlined above), and if so, some portions of the fulfillment process may be omitted for the sake of efficiency (e.g., if various safety/quality assurance criteria are met). For instance, the request specification may be sent to data sources and resulting data may be joined in appropriate node(s) in order to avoid unnecessary work, with final data/data set(s) then being delivered to the requestor.
Feedback—This pattern enables a requestor to provide feedback to the data pipeline controller regarding its automated actions. For instance, a data requester may provide data usage/quality feedback to the data pipeline controller, which can then use the feedback to fine tune various relevant data manipulation processes.
Discovery—this pattern enables the data pipeline controller to discover functionalities of data pipeline functions. The discovery pattern may include two aspects. (1) Proactive discovery, in which a pre-specified model (e.g., information model) may be provided to the data pipeline controller. Based on scheduling and the information model specification, the data pipeline controller may proactively discover newly formed data pipeline components (and/or data pipeline component types) or may discover updates to data pipeline components (and/or data pipeline component types) that may have been modified. (2) Reactive discovery, in which each data pipeline component, once instantiated or modified, may notify the data pipeline controller of its existence. In some cases, where the data pipeline controller engages in a proactive discovery role, the data pipeline controller may follow what is defined in an information model and may verify the existence of underlying data pipeline components (e.g., one or more instances of data pipeline component types which is/are identified in the information model). An information model may also be leveraged in a “reactive” model. In this case, data pipeline components may notify the data pipeline controller of the components' whereabouts and details.
In addition, in one example, when the data pipeline controller becomes aware of a new data source or other data pipeline components (or a new data source type and/or a new data pipeline component type), the data pipeline controller may attempt to derive a default data schema (and for a new data source, to also profile the data). The data schema may be in terms of the symbols that the data pipeline controller is made aware of (e.g., from a provided ontology). A system operator may also validate or correct the automatically-generated data schema. Additionally, the data pipeline controller may validate fresh batches of data from a data source against a previously defined data schema, and any differences in the statistical profile of the new batch versus previous batches may be noted.
Thus, examples of the present disclosure provide a framework for a data pipeline controller that supports both data request and data fulfillment. Users no longer need to know the details of how to acquire or reformat the data sets. This is handled by the data pipeline controller configuring the data pipeline instances. The data pipeline controller comprises various modules which collectively function to decompose a single data request into sub-parts. In one example, a data pipeline controller of the present disclosure may dynamically decide alternative ways to obtain the requested data set(s) when one or more data sources are not available. Based on a request, a data pipeline controller may dynamically command a data pipeline to create intermediate nodes which can, for example, act as temporary staging points to optimally accomplish sharing/reuse for performance gains. In addition, a data pipeline controller of the present disclosure may generate data schema(s) for new types of data sources and/or data pipeline components (e.g., when data schemas are not provided with these new components).
In one example, a data pipeline controller architecture may include three high level subsystems: support modules, which discover components within a data pipeline environment, including modules, technologies, collectors, filters, etc.; management & assembly modules, which can read ontologies and use newly discovered information regarding data pipelines and/or data pipeline components to refine information model(s) in real-time (the information model contains a specification indicating how an existing pipeline structure can be enhanced or how a new pipeline structure can be established); and request fulfillment modules, which fulfill client data requests and hide execution details from the clients (request fulfillment modules will use the predefined information models to dynamically service each data request). In accordance with the present disclosure, a data blending module may be included in a blending layer that is added in the middle of the three abovementioned subsystems. The functions provided by this blending layer may be used by all three subsystems as described below. In another example, the data blending module may be included in the support modules or may be in communication with other component modules of a data pipeline controller via a high-speed bus.
In accordance with the present disclosure, a data blending module may establish a plan for blending data on behalf of a single request or multiple requests from multiple clients (e.g., for potential reuse and data processing efficiency). For example, when multiple data delivery requests are received, the data blending module may be invoked to provide an analysis to determine if duplicated datasets are involved. The data blending module may also check if all or a portion of a requested dataset or datasets is available in one or many of the intermediate nodes. The data blending module may also check if the data source(s) have been updated (e.g., while the intermediate node(s) may not have been synced/updated). In one example, the data blending module may determine a most efficient pipeline configuration to deliver data from the source(s) to the target(s) (or at least a more efficient pipeline configuration as compared to independently establishing data pipelines for each request according to respective information models associated with each request), e.g., with a least number of intermediate processing steps (or at least a reduced number of intermediate processing steps). In some cases, the data blending module may determine to reuse a dataset residing at a target to be redirected to a new target. In one example, the data blending module may create two data pipelines (or two separate paths in a single pipeline) to allow source data to be delivered to a target in multiple paths. In one example, when two paths for the same dataset are routed to two different intermediate nodes, real data movement occurs from the source, which may avoid making duplications from a single intermediate node. In one example, this enables a “true” redundant copy to be created in a regular session.
In one example, the data blending module is capable of and is configured to override any existing data pipeline functions. From an external perspective, the existing data pipeline functions may appear to be unaltered, since the ultimate results are the same. However, internally, some of the data delivery paths (e.g., data pipelines) may be rearranged. This may also help to deter any attacks on the data pipeline infrastructure. For instance, an attacker may not know the real path(s) of the data pipeline(s). In addition, in one example, a data blending module of the present disclosure is capable of handling stored datasets as well as streaming datasets, or data streams.
In one example, a data blending module supports “full blending,” “partial blending,” or “any level of blending via policy enablement.” For example, when a request is received, the data blending module may determine that a full blending may violate some authentication rules or data access rules. As long as a subset of a full dataset is allowed to be retrieved, the data blending module may establish a partial blending to facilitate the creation of an intermediate dataset, and then from the intermediate dataset, the filtering-out of appropriate data attributes to satisfy individual data request(s). In one example, the data blending module assumes a “blending as a default” setting, unless otherwise specified via a “class setting.” For instance, data requests may be classified from 1 and 10, or specified by a provider of the data pipeline environment. In one example, classes 1-3 may be reserved for special requests (e.g., data requests for state of federal governmental entities, corporate data requests via public cloud, etc.). For these classes, data blending may be provided when specifically authorized by a requesting client or when authorized by an organization responsible for the requesting client. In one example, when data blending is not set as a default, “reverse hooks” may be specified in an information model or controlled by “policies.” On the other hand, classes 4-10 may allow data blending automatically, or by default.
Thus, based on policies and information modeling, a data blending module may dynamically configure intermediate nodes to perform a variety of data operations. The data blending module may provide any or all of the following functions: the data blending module may collectively process multiple requests for data delivery to determine optimal or more efficient delivery paths for each data request in a secure and reliable way; the data blending module may follow policies and information models (tuned periodically by machine learning modules) to establish blended pipelines or paths; the data blending module may dissect dataset(s) into appropriate chunks to facilitate reuse; and the data blending module may blend datasets and store the datasets in various intermediate nodes for reliability, redundancy, and/or reuse. In one example, data blending may be offered as a service. For example, a third party provider may offer an umbrella of alternative blending capabilities that can be subscribed to by clients. This also enables an open architecture for a traditionally closed data pipeline environment. These and other aspects of the present disclosure are described in greater detail below in connection with the examples ofFIGS.1-5.
To further aid in understanding the present disclosure,FIG.1 illustrates anexample system100 in which examples of the present disclosure for generating a data schema for a type of data pipeline component and storing an ontology and the data schema for the type of data pipeline component in a catalog of data pipeline component types and/or for configuring data pipeline components for delivering a first data set to at least a first destination and for delivering a second data set to at least the second destination in accordance with a plan comprising a combination of a first information model associated with a first request and a second information model associated with a second request and including at least one modification to the combination may operate. Thesystem100 may include any one or more types of communication networks, such as a traditional circuit switched network (e.g., a public switched telephone network (PSTN)) or a packet network such as an Internet Protocol (IP) network (e.g., an IP Multimedia Subsystem (IMS) network), an asynchronous transfer mode (ATM) network, a wireless network, a cellular network (e.g., 2G, 3G, 4G, 5G and the like), a long term evolution (LTE) network, and the like, related to the current disclosure. It should be noted that an IP network is broadly defined as a network that uses Internet Protocol to exchange data packets. Additional example IP networks include Voice over IP (VoIP) networks, Service over IP (SoIP) networks, and the like.
In one example, thesystem100 may comprise atelecommunication network101.Telecommunication network101 may combine core network components of a cellular network with components of a triple play service network; where triple-play services include telephone services, Internet services and television services to subscribers. For example,telecommunication network101 may functionally comprise a fixed mobile convergence (FMC) network, e.g., an IP Multimedia Subsystem (IMS) network. In addition,telecommunication network101 may functionally comprise a telephony network, e.g., an Internet Protocol/Multi-Protocol Label Switching (IP/MPLS) backbone network utilizing Session Initiation Protocol (SIP) for circuit-switched and Voice over Internet Protocol (VoIP) telephony services.Telecommunication network101 may further comprise a broadcast television network, e.g., a traditional cable provider network or an Internet Protocol Television (IPTV) network, as well as an Internet Service Provider (ISP) network. In one example,telecommunication network101 may include a plurality of television (TV) servers (e.g., a broadcast server, a cable head-end), a plurality of content servers, an advertising server (AS), an interactive TV/video on demand (VoD) server, and so forth. For ease of illustration, various additional elements oftelecommunication network101 are omitted fromFIG.1.
Thetelecommunication network101 may be in communication withdata pipeline infrastructure120 and the Internet in general (not shown). In one example, thedata pipeline infrastructure120 may comprise “public” cloud or “private” cloud infrastructure. For instance, all or a portion of thedata pipeline infrastructure120 may be controlled by a same entity astelecommunication network101. In such an example, thedata pipeline infrastructure120 may be considered part of thetelecommunication network101. Alternatively, or in addition, all or a portion of thedata pipeline infrastructure120 may be controlled by and/or operated by another entity providing cloud computing services to clients/subscribers. Thedata pipeline infrastructure120 may include a plurality of data pipeline components127, such as adapters, collectors, intermediate nodes, forwarders, data stores, and so forth. Thedata pipeline infrastructure120 may comprise servers/host devices (e.g., computing resources comprising processors, e.g., central processing units (CPUs), graphics processing units (GPUs), programmable logic devices (PLDs), such as field programmable gate arrays (FPGAs), or the like, memory, storage, and so forth), which may provide virtualization platforms for managing one or more virtual machines (VMs), containers, microservices, or the like. For instance, in such case the data pipeline components127 may comprise virtual machines, containers, microservices, or the like, which may provide the various functions of data pipeline components, such as a collector, an adapter, a forwarder, etc. In one example, the data pipeline components127 may also include dedicated hardware devices, e.g., one or more servers that may comprise one or more adapters, collectors, intermediate nodes, etc. and which may be configured to operate in various data pipelines (but which may not be readily adaptable to provide a different type of service). In one example, the data pipeline components may each comprise a computing system or server, such ascomputing system300 depicted inFIG.3, and may be configured to provide one or more operations or functions in connection with examples of the present disclosure for generating a data schema for a type of data pipeline component and storing an ontology and the data schema for the type of data pipeline component in a catalog of data pipeline component types and/or for configuring data pipeline components for delivering a first data set to at least a first destination and for delivering a second data set to at least the second destination in accordance with a plan comprising a combination of a first information model associated with a first request and a second information model associated with a second request and including at least one modification to the combination, as described herein.
In one example, thedata pipeline infrastructure120 may also include one ormore data sources125 and one ormore targets129. However, in another example, these devices or systems may be considered to be outside thedata pipeline infrastructure120. Thedata sources125 may include network devices, e.g., routers, switches, multiplexers, firewalls, traffic shaping devices or systems, base stations, remote radio heads, baseband units, gateways, and so forth. The data from thedata sources125 may therefore comprise various types of network operational data, such as: channel quality information, a number of endpoint devices served by a base station, records and/or alerts regarding network anomaly detections, throughput information, link connectivity information, port utilization metrics, and so on. In one example, thedata sources125 may alternatively or additionally comprise sensor devices, e.g., temperature sensors, humidity sensors, wind speed sensors, magnetometers, pressure sensors, etc. Thus, the data fromdata sources125 may comprise measurements of temperature, humidity, wind speed, pressure, magnetic field strength and/or direction, and so forth. In still another example, thedata sources125 may alternatively or additionally include digital still and/or video cameras, photograph and/or video repositories, medical imaging repositories, financial data storage systems, medical records storage systems, and so forth. Accordingly, the data that is available fromdata sources125 may alternatively or additionally include, images, videos, documents, and so forth. It should be noted that data fromvarious data sources125 may be filtered and transformed to achieve one or more data sets and/or subsets of data that can be common across a set of data pipelines and data pipeline instances. In one example, thetargets129 may comprise various devices and/or processing systems, which may include various machine learning (ML) modules hosting one or more machine learning models (MLMs). For instance, a first one of thetargets129 may comprise a MLM to process image data and may be trained to recognize images of different animals, a second one of thetargets129 may comprise a MLM to process financial data and may be trained to recognize and alert for unusual account activity, and so forth.Targets129 may also include user endpoint devices, storage devices, and so forth.
As further illustrated inFIG.1,telecommunication network101 may include adata pipeline controller110. In one example, thedata pipeline controller110 may comprise a computing system or server, such ascomputing system300 depicted inFIG.3, and may be configured to provide one or more operations or functions for generating a data schema for a type of data pipeline component and storing an ontology and the data schema for the type of data pipeline component in a catalog of data pipeline component types and/or for configuring data pipeline components for delivering a first data set to at least a first destination and for delivering a second data set to at least the second destination in accordance with a plan comprising a combination of a first information model associated with a first request and a second information model associated with a second request and including at least one modification to the combination, as described herein. For instance, a flowchart of anexample method200 for generating a data schema for a type of data pipeline component and storing an ontology and the data schema for the type of data pipeline component in a catalog of data pipeline component types is illustrated inFIG.2 and described in greater detail below. In addition, a flowchart of anexample method500 for configuring data pipeline components for delivering a first data set to at least a first destination and for delivering a second data set to at least the second destination in accordance with a plan comprising a combination of a first information model associated with a first request and a second information model associated with a second request and including at least one modification to the combination is illustrated inFIG.5 and described in greater detail below.
It should be noted that as used herein, the terms “configure,” and “reconfigure” may refer to programming or loading a processing system with computer-readable/computer-executable instructions, code, and/or programs, e.g., in a distributed or non-distributed memory, which when executed by a processor, or processors, of the processing system within a same device or within distributed devices, may cause the processing system to perform various functions. Such terms may also encompass providing variables, data values, tables, objects, or other data structures or the like which may cause a processing system executing computer-readable instructions, code, and/or programs to function differently depending upon the values of the variables or other data structures that are provided. As referred to herein a “processing system” may comprise a computing device including one or more processors, or cores (e.g., as illustrated inFIG.3 and discussed below) or multiple computing devices collectively configured to perform various steps, functions, and/or operations in accordance with the present disclosure.
In one example, thedata pipeline controller110 may include a plurality of modules111-119 which provide for particular functions of thedata pipeline controller110. For instance, each component/module111-119 may comprise respective code, executable images, etc., that can be loaded into memory and executed by one or more processors to collectively comprise an operationaldata pipeline controller110.
As noted above, each of the data pipeline components127 may have a data pipeline component type, such as an adapter, collector, forwarder, etc. In one example, for each data pipeline component type, thedata pipeline controller110 may store a respective data schema in the ontology anddata schema repository115. A data schema for a data pipeline component type establishes how a function of a data pipeline component (of the data pipeline component type) is performed at runtime. It includes relationships among data attributes along with a mini-flow (or micro-level flow sequence). In addition, for each data pipeline component type, the ontology anddata schema repository115 may also store a respective ontology for the data pipeline component type. An ontology defines what an instance of the data pipeline component type is and the functions of the data pipeline component instance (e.g.,Vendor 3 Adapter 6 Version 2) and its functions (but does not define how the functions are used—this is provided by the data schema). It should also be noted that insofar as thedata sources125 andtargets129 may comprise part of a data pipeline, these devices or systems may also have respective data pipeline component types for which respective ontologies and associated data schemas may be stored by the ontology anddata schema repository115.
In general, an ontology defines classes (also referred to as “concepts” or “attributes”) and properties (also referred to as “slots”) defining features of the classes. As described herein, each data pipeline component type has its own ontology. However, in some taxonomies, each data pipeline component type may comprise its own “class” in a singular ontology or knowledge base of “data pipeline component types” with additional attributes of the data pipeline component type comprising “sub-classes” in one or more layers below the “class” layer. The ontologies for different data pipeline component types may thus be considered “classes” according to some interpretations. In one example, the format of an ontology may be defined by an operator of thedata pipeline controller110. For instance, an ontology format may have a hierarchy of layers or levels, there may be certain required classes, certain required properties, etc., certain required class restrictions, certain required values for one or more properties, class restrictions, etc., and so on.
In one example, for each new data pipeline component type that becomes available, a vendor may provide an associated ontology. In some cases, a vendor of a new data pipeline component type may also provide an associated data schema. This is illustrated inFIG.1 where an ontology and/or data schema for a data pipeline component type190 may be input to the data schema generator/updater module116. For instance, the ontology and/or data schema for a data pipeline component type190 may be provided via one ofvendor devices185. In an example where the vendor has provided both an ontology and a data schema, the data schema generator/updater module116 may simply store a record for the new data pipeline component type comprising the ontology and the data schema in the ontology anddata schema repository115. However, where only an ontology is provided, the data schema generator/updater module116 may automatically generate a data schema based upon the ontology and store the record comprising the ontology and the data schema in the ontology anddata schema repository115.
In particular, the data schema generator/updater module116 may determine a similarity between the new type of data pipeline component and one or more existing types of data pipeline components having records in the ontology anddata schema repository115. In one example, the similarity between the new type of data pipeline component type and an existing type of data pipeline component may be quantified based upon a congruence between the ontology of the new type of data pipeline component (e.g., a first ontology) and the ontology of the existing type of data pipeline component (e.g., a second ontology). For example, the congruence may be based upon a number of matches between classes, properties, and/or class restrictions (broadly, “features”) of the first ontology and the classes, properties, and/or class restrictions (e.g., “features”) of the second ontology. In one example, there may be different weights applied for matches among different features e.g., depending upon the level of the features within a hierarchy of the ontology format, for example.
In one example, the data schema generator/updater module116 may copy or provide the data schema for the best matching (e.g., the highest congruence measure or score) existing type of data pipeline component as a template for a data schema for the new type of data pipeline component. In one example, the data schema generator/updater module116 may provide a notification to an operator of thedata pipeline controller110, e.g., at one ofclient devices188, indicating the automatic selection of a data schema template for the new type of data pipeline component. In one example, the operator may then approve of the template for use as the data schema for the new type of data pipeline component. In one example, the operator may make changes or modifications to the template, and provide the changes/modifications to the data schema generator/updater module116. In one example, data schemas for a top X matching data pipeline components may be returned to the operator, from which the operator may browse and select one of the data schemas as a template (and which may be unmodified, or which may be modified by the operator) that is returned to the data schema generator/updater module116. Thus, the operator may verify that the data schema generator/updater module116 is generating valid data schemas. The data schema generator/updater module116 may then store the template (either modified or unmodified) as the data schema for the new type of data pipeline component, along with the respective ontology, in the ontology and data schema repository. Instances of the new type of data pipeline component may then be made available for use in thedata pipeline infrastructure120.
To support the fulfillment of requests by thedata pipeline controller110, there may be a catalog of predefined “information models,” stored ininformation model repository114. The information models may comprise specifications for data pipelines for various use cases. For instance, in one example each “task type” may have an associated information model. In another example, there may be a number of information models associated with each task type. For instance, a first information model associated with the task type of “market intelligence” may relate to “cellular,” and a second information model associated with the task type of “market intelligence” may relate to “VoIP.” In one example, each information model may be associated with or may comprise metadata relating to one or more of: a name, a region, a task type, a technology, and various other types of parameters. As illustrated inFIG.1, aninformation model195 may be submitted by an operator via one ofclient devices188 to the information model updater/generator module113, which may store the information model ininformation model repository114. Once stored in theinformation model repository114, theinformation model195 may then be used in fulfillment of requests (e.g., requests which are matched to the information model195).
As noted above, each information model may comprise a specification for a data pipeline. For instance, each information model may comprise hooks to a plurality of data schemas. The data schemas may be for a plurality of data pipeline component types. As also noted above, the data schemas are specific to particular component types, and provide information on how each of the data source(s) and/or data pipeline component127 may be utilized, accessed, interacted with, etc. For instance, data pipeline components127 may include components of various component types, such as: adapters, collectors, intermediate nodes, forwarders, data stores, and so forth. For instance, data pipeline components127 may include two components of type “A” (e.g., A1 and A2), two components of type “B” (e.g., B1 and B2), and one component each of component types “C” and “D.” In the present example,information model195 may comprise or provide a specification which may result in the establishment and/or reconfiguration of thedata pipeline121, which may include A1, B1, C, and D from data pipeline components127. In one example, the information model, or specification, may include a plurality of mini-specifications for driving data retrievals and data joins. For instance, each mini-specification may be tailored to a respective data source (or data source type). In one example, a higher-level specification may be delivered to intermediate points to merge data streams. The specification(s) may be configured based upon the data schemas of respective data pipeline component types and the overall sequence of theinformation model195.
A mini-specification may also be tailored to a set of pipeline instances where data from a more general view is filtered or enriched for their instance-specific scopes. For example, data fulfillment, management and assembly modules may efficiently optimize synergies across pipeline requirements, maintain data source updates from sources, and utilize transformation processes to map those updates to pipeline instance requirements, and manage filtering, enriching, and propagating the updates into pipeline instances for data ingestion. Using the information models and data pipeline requirements, thedata pipeline controller110 may optimize pipeline infrastructure workload requirements to maximize and manage synergies across existing/new data pipeline controller types to ensure data source updates occur to fulfill data pipeline instance requirements and service level agreements (SLAs), and to further achieve economies of scale.
In one example, a new information model, such asinformation model195, may lead to the discovery of a new data pipeline component type. For instance, an information model may assume the existence of a data pipeline component type for which there is no record in the ontology anddata schema repository115. In such case, the information model updater/generator module113 may notify the operator via theclient device188 that an ontology and data schema are missing for this assumed-to-be new data pipeline component type. In one example, the operator ofclient device188 may provide an ontology, a data schema, or both, which may be provided to the data schema generator/updater module116. In another example, the operator may contact a vendor, which may be requested to provide an ontology and/or a data schema.
To further illustrate the functions and features ofdata pipeline controller110, anexample request197 for delivery of data from one or more of thedata sources125 to one or more of thetargets129 may be processed by thedata pipeline controller110 as follows. First, therequest197 may be crafted via aclient device188, which may specify a desired delivery of data from one or more ofdata sources125 to one or more of thetargets129. It should be noted that in one example, therequest197 may comprise a “trigger,” e.g., where the requestingclient device188 is an automated system. Therequest197 may identify specific types of data, specific fields of data, specific sources or types of sources, geographic locations of sources or logical groupings of sources (e.g., all routers within a given network region, all devices in a subnet, all base stations in a selected state, wind speed information for a selected geographic area for a selected time period, all captured images or video in a selected area for a selected period of time, etc.). In one example, a user may generate therequest197 in accordance with a request template, such as in accordance with the example request template/specification and analysis/planning use pattern described above.
In one example, therequest197 may initially be received and processed via the request interpreter andfulfillment module111 ofdata pipeline controller110. The request interpreter andfulfillment module111 may first attempt to match therequest197 to a most applicable information model. For instance, the request interpreter andfulfillment module111 may first parse the request to determine whichdata sources125 are implicated, the data ofdata sources125 that is being requested, the target(s)129 to which the data is to be delivered, etc. Therequest197 may be simple in some cases, but may include (directly or by reference) detailed specification information such that the appropriate data or dataset(s) can be identified, prepared, and provided. Note that in some cases, the request interpreter andfulfillment module111 may first check if a same request has recently been processed by thedata pipeline controller110, and if so, some portion of the fulfillment process may be omitted for the sake of efficiency (e.g., if various safety/quality assurance criteria are met). For instance, a specification for therequest197 may be sent todata sources125 and resulting data may be joined in appropriate node(s) (e.g., data pipeline components127) in order to avoid unnecessary work, with final data/dataset(s) then being delivered to the desired target(s)129. Otherwise, additional analysis and planning may first be executed.
In one example, the request interpreter andfulfillment module111 may be configured to process requests that may be in accordance with various Data Definition Languages (e.g., Structured Query Language (SQL), eXtensible Markup Language (XML) Schema Definition (XSD) Language, Java Script Object Notation (JSON) Schema, etc.). In one example, the request interpreter andfulfillment module111 comprises an abstract symbol manipulator that extracts symbols from data definition languages and handles rules relating the symbols. As such, thedata pipeline controller110 may handle any data for which descriptor symbols have been provided.
In one example, thedata pipeline controller110 may map therequest197 to a most appropriate information model. For instance, the request127 may comprise metadata relating to one or more names (e.g., of one or more of thedata sources125,targets129, types of data sources, and/or types of targets, etc.), one or more regions (e.g., a town, a county, a state, a numbering plan area (NPA), a cell and/or a cluster of cells, a subnet, a defined network region (e.g., a marketing area), etc.), one or more task types (e.g., “market intelligence,” “network load balancing,” “media event support” (e.g., data analysis for large network-impacting events, such as for large concerts, sporting events, etc.), and so forth), a technology (e.g., cellular, Voice over Internet Protocol (VoIP), fiber optic broadband, digital subscriber line (DSL), satellite, etc.), and/or various additional parameters. Such metadata, or parameters, may be explicitly defined in therequest197 as particular metadata fields or may be extracted from the terms of the request197 (e.g., identified in a query in accordance with a particular Data Definition Language). In any case, the request interpreter andfulfillment module111 may identify various metadata/parameters of therequest197 and may provide such terms to theinformation model repository114.
Theinformation model repository114 may store a plurality of “information models” (e.g., a catalog or data store). The information models may comprise specifications for data pipelines for various use cases. For instance, in one example each “task type” may have an associated information model. In another example, there may be a number of information models associated with each task type. For instance, a first information model associated with the task type of “market intelligence” may relate to “cellular,” and a second information model associated with the task type of “market intelligence” may relate to “VoIP”. In one example, each information model may be associated with or may comprise metadata relating to one or more of: a name, a region, a task type, a technology, and various other types of parameters.
In one example, theinformation model repository114 may map the request to one or more of the information models. For instance, theinformation model repository114 may map the request to the at least the first information model based upon a congruence between the metadata of the request and the metadata of each of the one or more information models. For instance, an information model having metadata that most closely matches the metadata of therequest197 may be identified. In one example, the top X information models having the closest matches to the metadata of therequest197 may be identified. The matching of therequest197 to each information model may be scored based upon a number of metadata fields that match. In one example, some fields may be weighted such that a match (or lack thereof) with respect to a given metadata field may have a greater or lesser impact on an overall score for the congruence, or match, between a given request and a particular information model. In one example, the top matching information model, or the top X matching information models may then be returned to the request interpreter andfulfillment module111. It should be noted that in another example, the matching may be performed via the request interpreter andfulfillment module111. For instance, the request interpreter andfulfillment module111 may scan the information models in theinformation model repository114 to determine matching scores for different information models. However, in any case, the request interpreter andfulfillment module111 may select one of the information models (e.g., the top matching information model) for use in establishing and/or reconfiguring a data pipeline to fulfill therequest197. It should be noted that in one example, therequest197 may be submitted in accordance with a request template that may be matched to theinformation model195. In such case, the request interpreter andfulfillment module111 may select theinformation model195 based upon the stored association between therequest197 and theinformation model195. It should also be noted that in one example, the request interpreter andfulfillment module111 may provide user tendency, and behavioral tracking and analytics. For instance, the request interpreter andfulfillment module111 may provide an enhanced user experience in which the request interpreter andfulfillment module111 may recognize the requestor and may use the past tendency to quickly identify and suggest one or more relevant information models.
In one example, the request interpreter andfulfillment module111 may provide a notification of the selected information model(s) to theclient device188 that submitted therequest197. In one example, the notification may provide an opportunity for theclient device188 to submit a confirmation to the request interpreter andfulfillment module111 to proceed with the selected information model (or to select one of the suggested information models for use). Likewise, the notification may provide an opportunity for theclient device188 to decline a selected information model. In such case, the request interpreter andfulfillment module111 may provide one or more additional information models as suggestions (e.g., one or more of the next top X of the closest matching information models). Alternatively, or in addition, the notification may provide theclient device188 with the opportunity to modify a selected information model, or to create a new information model using the selected information model as a template (e.g., along with possible additional modifications). For instance, a user of theclient device188 submitting therequest197 may be aware of a new type of data pipeline component that is desired to be included in the eventual data pipeline. As such, the user may modify the information model and submit as a change to the information model, or may submit as a new information model.
In one example, for each new information model that is submitted, and/or for each information model that is modified, the information model updater/generator module113 may verify that data source(s)125, data pipeline component(s)127, and/or target(s)129 exist that are of the types of data source(s), data pipeline component(s), and/or target(s) indicated in the specification of the information model, and which are permitted to be controlled via thedata pipeline controller110. In other words, the information model/updater generator module113 may first verify that thedata pipeline infrastructure120 is able to fulfill requests that may invoke the information model. In one example, the information model updater/generator module113 may communicate with the data pipelinecomponent discovery module118 to complete this task. For instance, data pipelinecomponent discovery module118 may maintain an inventory of all of the available data pipeline infrastructure120 (e.g., data source(s)125, data pipeline components127, target(s)129, etc.).
In one example, each time a component is added to thedata pipeline infrastructure120, a notification may be provided to the data pipelinecomponent discovery module118. For instance, each of the data pipeline components127 may be configured to self-report an instantiation and/or a deployment. Alternatively, or in addition, a software defined network (SDN) controller that is responsible for deploying one of the data pipeline components127 may transmit a notification to the data pipelinecomponent discovery module118. Similarly, a user who is responsible for deploying one of the data pipeline components127 may be responsible for a notification to the data pipeline component discovery module118 (e.g., via one of client devices188).
It should be noted that new information models may be submitted in connection with a request fulfillment process, or may be submitted without connection to a particular request. For instance, a user may develop an information model for a new anticipated use case, without having a specific request for which a data pipeline is to be immediately built. In one example, a user, e.g., via one ofclient devices188 may browse the catalog of theinformation model repository114 and may utilize any existing information models as a template for a new information model. As illustrated inFIG.1, the interactions ofdata pipeline controller110 and one ofclient devices188 for generating and/or submitting a new information model may be via information model updater/generator module113. However, in another example, theinformation model repository114 may alternatively or additionally comprise an application programming interface (API) which may allow more direct access the catalog of information models from the one ofclient device188. In one example, user objects, information model objects, and data pipeline component type objects are all first class citizens in the architecture so any user could act on (view) any information model/template or data pipeline component type. In addition, there may be no unnecessary hierarchical control imposed over the inventories that would reduce data sharing and limit automation. In accordance with the present disclosure, each object may have an intrinsic identity, may be dynamically constructed at runtime, and may be passed as a parameter.
Once an information model is selected and finalized (e.g., approved for use and/or not objected to), the request interpreter andfulfillment module111 may also verify that theclient device188 and/or a user thereof is authorized to create a data pipeline with regard to the data being requested, that the desired target(s)129 are permitted to receive the requested data, that theclient device188 and/or a user thereof is permitted to utilize particular data pipeline components types that are indicated in the specification, and so forth. For instance, the request interpreter andfulfillment module111 may submit the specification toauthorization module112 along with an identification of the one ofclient devices188 and/or an identification of a user thereof.Authorization module112 may maintain records of the permissions for various ones ofclient devices188 and/or various users or user groups, the permissions of various data pipeline component types, the permissions for specific ones of the data pipeline components127, data source(s)125, and/or target(s)129, and so forth. In one example,authorization module112 may additionally include information regarding user preferences, limitations, exception handling procedures, etc. If the records associated with the user, the one ofclient devices188, the data pipeline component type(s), etc. are indicative that a data pipeline may be built or adapted to fulfill therequest197 in accordance with the selected information model, then theauthorization module112 may return an positive confirmation, or authorization, to the request interpreter andfulfillment module111. In addition, upon receipt of a positive confirmation/authorization the request interpreter andfulfillment module111 may submit the selected information model (e.g., along with parameters of the request197), to the data pipeline management and assembly (DPMA)module117.
In one example, theDPMA module117 is responsible for generating a data pipeline or reconfiguring a data pipeline to fulfill therequest197 in accordance with the information model that is selected (such as information model195). For instance, theDPMA module117 may decompose the specification of theinformation model195 into mini-specifications for driving data retrieval and data joins, e.g., one mini-specification per data source. For instance, in the present example,information model195 may comprise or provide a specification which may result in the establishment and/or reconfiguration of thedata pipeline121, which may include A1, B1, C, and D from data pipeline components127. In one example, a higher-level specification may be delivered to intermediate points to merge data streams. To illustrate, theDPMA module117 may determine that the information model provides a roadmap for establishing a data pipeline for delivering base station performance data from one or more data sources to one or more targets. The request parameters may provide information regarding the geographic scope of the request. In one example, theDPMA117 may select particular data sources ofdata sources125 having the requisite base station performance data in accordance with the geographic scope information. In one example, the determination may be made using information stored in data pipelinecomponent discovery module118.
In one example, the information model may indicate that an aggregator component is called for as a first intermediate node indata pipeline121.DPMA117 may determine that there are multiple aggregator components available in the data pipeline infrastructure (e.g., A1 and A2). However,DMPA117 may select one of these in accordance with the request parameters, e.g., using the geographic scope information, using information regarding the distance or latency from the data source(s)125 (e.g., after selecting the appropriate data source(s)125), and so forth. For instance, in the present example,DPMA117 may select an aggregator component A1 from the available data pipeline components127. It should be noted thatDPMA117 may select additional data pipeline components B1, C, and D from the available data pipeline components127 following a similar analysis.
In one example,DPMA117 may instantiate thedata pipeline121 in response to the request197 (or in response to an instruction from the request interpreter andfulfillment module111 containing the selected information model and parameters of the request127). In one example,DPMA117 may configure the data pipeline components A1, B1, C, and D in accordance with hooks in the information model and/or specification which invoke data schemas associated with the respective data pipeline components types of the data pipeline components A1, B1, C, and D. For instance, a data schema for data pipeline component A1 may indicate the available commands which may be used to configure data pipeline component A1, the values of different arguments or parameters which may be used in one or more commands, and so forth. In one example, the hooks in the information model (e.g., information model195) may be executed byDPMA117 to retrieve or to invoke the respective data schemas. However, specific configuration commands may be tailored to the particular data pipeline components127 that are selected (e.g., to direct configuration commands to A1 (and not to A2), to B1 (and not to B2), to C, and to D). Accordingly, using the various data schemas,DPMA117 may configure the data pipeline components A1, B1, C, and D to function asdata pipeline121 and to move the requested data from the one or more ofdata sources125 to one or more oftargets129.
To illustrate, data pipeline component A1 may be configured to obtain base station operational data from at least two of thedata sources125 and to aggregate the data at the node. For instance, data pipeline component A1 may utilize Apache Kafka, Data Movement as a Platform (DMaaP), nanomsg, or the like to “subscribe” to the data from the relevant data sources125. In one example, data pipeline component A1 may also be configured to periodically forward the aggregated data to data pipeline component B1. Data pipeline component B1 may be configured to generate summary data, such as 5 minute moving averages, etc., to pare the data, such as removing extra fields, and so forth. Data pipeline component C may be configured to obtain summary data from data pipeline component B1 (e.g., again using Kafka, DMaap, nanomsg, or the like), to smooth the data and remove any outliers, and to place the processed data into a JSON format. Lastly, data pipeline component D may be configured to periodically obtain the data that is further processed from data pipeline component C, to store a copy of the processed data, and to forward the processed data to the desired one or more oftargets129.
It should be noted that in one example, parameters of therequest197 may indicate a limited temporal scope of the requested data. As such, in one example,DPMA117 may configure the data pipeline components A1, B1, C, and D to cease the specific functions configured fordata pipeline121 after the temporal scope of the request has passed. However, it should also be noted that as indicated above, the data pipelinecomponent discovery module118 may maintain information regarding the availability and current configurations of data pipeline components127, thedata pipeline121, other data pipelines, etc. As such, in one example, all or a portion of the data pipeline121 (e.g., the configurations of any one or more of the data pipeline components A1, B1, C, and D) may be maintained after the fulfillment of therequest197, such as if a new request is received and processed bydata pipeline controller110 and if it is determined that the same data is being requested. Thus, for example, the data may be maintained in data pipeline component D for an additional duration so as to fulfill this additional request. For instance, there may be one or more predictors that suggest that one or more of thedata sources125 may be reused again based on historical trends.
Alternatively, or in addition, the new request may be for obtaining data that partially overlaps with the data requested inrequest197. For instance, the new request may be for similar base station operational data having the same geographic scope, but for a more extended time period, or for a time period that partially overlaps with a time period specified in therequest197. In such case,DPMA117 may maintain thedata pipeline121 for an additional duration so as to obtain the additional data associated with the time period of the new request. Additional scenarios may also lead to the full or partial reuse ofdata pipeline121 or other data pipelines. For instance, in anotherexample data pipeline121 may be integrated with another data pipeline, may be expanded with one or more additional data pipeline components to fulfill a new request (such as adding an additional aggregator for obtaining additional base station operational data from an additional geographic region), and so forth.DPMA117 may maintain an underlying source feed process that a plurality of data pipeline instances depend on, as long as a subset of the data pipeline instances continue to exist.DPMA117 may be able to reduce the frequency of enrichment, or lower other characteristics of one or more of the remaining data pipelines instances to compensate for new resulting requirements of any or all of the remaining data pipeline instances.
To further illustrate, in one example,data pipeline121 may be in existence (e.g., having been created configured, and either in-use or remaining idle/in standby mode) prior to therequest197. In such case, similar to the example above,DPMA module117 may determine that the information model provides a roadmap for establishing a data pipeline for delivering base station performance data from one or more data sources to one or more targets. The request parameters may provide information regarding the geographic scope of the request. Thus, theDPMA117 may select particular data sources ofdata sources125 having the requisite base station performance data in accordance with the geographic scope information. In one example, the determination may be made using information stored in data pipelinecomponent discovery module118. However, the information stored in data pipelinecomponent discovery module118 may also indicate thatdata pipeline121 is operational within thedata pipeline infrastructure120 and is available to fulfill therequest197. In this case, the nodes of data pipeline121 (e.g., data pipeline components A1, B1, C, and D) may be reconfigured to fulfill therequest197. For instance, the data pipeline components A1, B1, C, and D may be configured/reconfigured using commands via the respective data schema to obtain additional data within the temporal and geographic scope of therequest197, to forward the processed data to one or more of thetargets129 via data pipeline component D, and so forth.
In still another example, theDPMA module117 may determine in accordance with the information model selected forrequest197 that the requested data may already be stored, e.g., at data pipeline component D. For instance, data pipeline component D may have come into possession of the data in accordance with a different request for which thedata pipeline121 was established. In such an example, data pipeline component D may also store extra data that is not relevant to request197. However, in such case,DPMA117 may establish a new, shortened data pipeline to fulfillrequest197. For instance, the data pipeline may comprise data pipeline component D (and in one example, the one or more target(s)129, which may also be considered part of the data pipeline). In such case, the configuration may involve configuring the target(s)129 as subscribers to a data feed from data pipeline component D comprising the portion of the data stored therein that is pertinent to therequest197.
In addition to the foregoing,data pipeline controller110 may also include adata blending module119. In accordance with the present disclosure,data blending module119 may establish a plan for blending data on behalf of a single request or multiple requests from multiple clients (e.g., for potential reuse and data processing efficiency). For example, when multiple data delivery requests are received, thedata blending module119 may be invoked to provide an analysis to determine if duplicated datasets are involved. Thedata blending module119 may also check if all or a portion of a requested dataset or datasets is available in one or many of the intermediate nodes, e.g., data pipeline components127. Thedata blending module119 may also check if the relevant data source(s)125 have been updated (e.g., while the intermediate node(s) may not have been synced/updated). In one example, thedata blending module119 may determine a most efficient pipeline configuration to deliver data from the relevant data source(s)125 to the relevant target(s)129 (or at least a more efficient pipeline configuration as compared to independently establishing data pipelines for each request according to respective information models associated with each request), e.g., with a least number of intermediate processing steps (or at least a reduced number of intermediate processing steps).
As noted above, request interpreter andfulfillment module111 may initially obtain and process each data request. In particular, request interpreter andfulfillment module111 may select an associated information model to use as a specification for establishing a data pipeline to fulfill each request. However, in one example, during the processing, whenever an information model indicates an available data blending capability, the request interpreter andfulfillment module111 may invoke thedata blending module119. Alternatively, or in addition, thedata blending module119 may also consult policies of one or more of the requesting clients, and/or polices that are applicable to classes of clients and/or categories of requests to determine whether, and for which request(s), data blending is permitted.
In an illustrative example, multiple requests (e.g., from different client devices188) may be received via request interpreter andfulfillment module111 at or around the same time, e.g., within five seconds, within 5 minutes, etc. The request interpreter andfulfillment module111 may determine that for any two or more of the requests, that data blending is permitted. The request interpreter andfulfillment module111 may then invoke thedata blending module119 for creating a combined plan for data path/data pipeline configuration. More specifically, thedata blending module119 may access information models associated with respective requests as its path(s) creation guidance. In one example, the request interpreter andfulfillment module111 may match requests to information models, e.g., as described above, and provide the information models to thedata blending module119. In another example, thedata blending module119 may perform the same or similar operations to independently match information models to the respective requests.
In one example, thedata blending module119 may determine common denominators among all data requests. For instance, thedata blending module119 may identify that at least a portion of data is a part of both a first data set that is specified in at least a first request and a second data set that a specified in at least second request (e.g., requests197 and198, as illustrated inFIG.1). Thedata blending module119 may then create a new data delivery plan, or roadmap for delivering requested data set(s) from source(s) to target(s) in accordance with the multiple data requests. This integrated roadmap may include (but is not limited to) a specification which details one or more of the data source(s)125 or intermediate nodes (e.g., one or more of data pipeline components127) from which data is to be initially obtained, how the data is to be obtained from the different data source(s)125 and or data pipeline components127, where to create one or more intermediate nodes (e.g., from data pipeline components127) to store interim datasets, how long data is to be stored at the intermediate node(s), which data manipulations to be performed on the interim datasets, and so forth. In one example, thedata blending module119 may confirm or obtain authorizations fromauthorization module112, which may delimit the ability to leverage the new data plan as a standard data view for requests that may not conform to the data scope, filtering, transformation, encryption, etc. that may be performed on a specific data requirement.
In one example, the plan, or roadmap, may comprise a combination of a first information model associated with the first request (e.g., request197) and a second information model associated with the second request (e.g., request198), and with at least one modification to the combination of the first information model and the second information model. It should be noted that in other examples, the plan may comprise a combination of a first information model, a second information model, a third information model, etc. along with one or more modification to the combination (e.g., depending upon the number of requests that are identified having overlapping requests for data). The modification(s) may include omission of one or more data pipeline components (e.g., as compared to a plan that would comprise a straight combination of the first information model and the second information model), an addition of at least one data pipeline component (e.g., a data pipeline component that is not present in the first information model and the second information model), an alteration to at least one setting for at least one data pipeline component that is present in the first information model or the second information model, and so on.
In one example, the at least one modification to the combination of the first information model and the second information may be selected for the plan based upon a number of factors, which may include a reduction (or in increase) of one or more metrics according to the plan as compared to the combination of the first information model and the second information model without the modification, such as: a determination of a reduction in an overall number of data pipeline components (and/or a reduction in a total number of intermediate processing steps), a determination of a reduction in a network bandwidth utilization, a reduction in a latency of a delivery of at least one of the first data set or the second data set (or an increase in an anticipated speed of delivery of the data set(s)), a determination of a reduction in a cost of a delivery of at least one of the first data set or the second data set, and so forth.
In one example, thedata blending module119 may select one or more modifications that do not violate a client policy (or client policies of respective clients). For instance, the client policy, or policies, may be contained in the first request (e.g., request197), the second request (e.g., request198), or both. Alternatively or in addition, the policy, or policies, may be maintained by thedata pipeline controller110 on behalf of the client(s), e.g., atauthorization module112, or by thedata blending module119 itself. Each client policy may specify a restriction on a location of at least one data pipeline component, a restriction on a sharing of at least one data pipeline component, a restriction on an access of other clients to at least a portion of the first data set or the second data set (for instance, a client may own a data source and may wish for the data to be exclusively processed), and so forth. Alternatively, or in addition, the at least one modification may be selected in accordance with an operator policy of an operator of thedata pipeline infrastructure120. For example, the operator policy may balance a reduction in an overall number of data pipeline components with a reduction in a latency of a delivery of at least one of the first data set or the second data set. Examples of the types of data pipeline configurations that may result from a combined plan generated viadata blending module119 are illustrated inFIG.4 and described in greater detail below.
In one example, thedata blending module119 may create a plan that first attempts to reduce an overall number of data pipeline components by consolidating separate data pipeline components from two data pipelines into a single function (a single data pipeline component that is shared by the two data pipelines), and then selecting a location (or providing criteria for selecting a location) for the shared data pipeline component. For instance, the plan may comprise a specification that indicates to use a data pipeline component in an existing location if one of the two pipelines is already established, or to select a location that minimizes latency or maximizes throughput based upon location criteria of both data pipelines (e.g., locations of sources and/or destinations, and/or locations of preceding and/or following nodes according to the respective information models, which may be combined to create the plan).
In one example, thedata blending module119 may select to add storage when at least portion of data is identified as being from a partial overlap in time of a first data set in accordance with the first request (e.g., request197) and a second data set in accordance with the second request (e.g., request198). For instance, thedata blending module119 may select to add storage to an existing node, to a node of a first data pipeline, or to a node a second data pipeline, or to add a new node that is not called for in either of the first information model or the second information based upon overall efficiency in data delivery, such as a reduced latency, or a reduction in latency balanced with cost of deployment of a new node, and so forth. In one example, the cost can be monetary, or the “cost” may be an additional resource utilization (e.g., a processor utilization, a memory utilization, and available storage in thedata pipeline environment120 or in a portion/region of thedata pipeline environment120, a number of available nodes of a particular type in thedata pipeline environment120 or in a portion/region of thedata pipeline environment120, etc.), and so forth. In one example, cost can be additional network bandwidth incurred to store data instead of stream the data directly to one or more of the target(s)129.
In one example, the plan, or roadmap, may be provided to the request interpreter andfulfillment module111 as an executable information model. The request interpreter andfulfillment module111 may then provide the plan (e.g., as an executable information model) to theDPMA module117 to establish appropriate data pipelines or data paths to fulfill the multiple data requests. In one example, the plan may also be stored in the support modules (e.g., theinformation model repository114, as an information model for future reuse). TheDPMA module117 may then establish data pipelines according to the shared plan. In particular, theDPMA module117 may utilize the shared plan in the same or similar manner as described above.
It should be noted that in one example, thedata blending module119 may request information from data pipelinecomponent discovery module118 regarding the availability of data pipeline components127, the status of the data pipeline components127, such as whether each of the data pipeline components127 is configured as part of an existing data pipeline, whether there is spare capacity to configure the data pipeline component for use in additional data pipelines, the data that may be stored in the data pipeline components127, and so forth. In one example, thedata blending module119 may utilize this topology and/or configuration information to determine the plan for one or more shared paths, or one or more alternate paths, for reliability, redundancy, and/or reuse.
For instance, thedata blending module119 may learn that data requested in one or both of therequest197 or therequest198 may already be stored at an intermediate node (one of data pipeline components127). In one example, this particular intermediate node may be included in the plan. For instance, the plan (e.g., a specification) may leave no choice to theDPMA module117 with respect to this particular data pipeline component. On the other hand, for additional data pipeline components, thedata blending module119 may create a plan with various criteria, such as directives, preferences, etc., which may be used byDPMA117 to make selections from among thedata sources125 and data pipeline components127 in accordance with the plan. For instance, these criteria may remain the criteria as specified in one or both of the information models associated with therequests197 and198, which may be combined to create the shared plan. Thus, thedata blending module119 may use information models as inputs, which can be adjusted based on the particulars of each request, as well as current conditions of thedata pipeline infrastructure120, operator policy, and/or one or more client policies, and so forth.
It should be noted that thesystem100 has been simplified. Thus, thesystem100 may be implemented in a different form than that which is illustrated inFIG.1, or may be expanded by including additional endpoint devices, access networks, network elements, application servers, etc. without altering the scope of the present disclosure. In addition,system100 may be altered to omit various elements, substitute elements for devices that perform the same or similar functions, combine elements that are illustrated as separate devices, and/or implement network elements as functions that are spread across several devices that operate collectively as the respective network elements. For example, thesystem100 may include other network elements (not shown) such as border elements, routers, switches, policy servers, security devices, gateways, a content distribution network (CDN) and the like, additional clouds, and so forth.
It should also be noted that the modules ofdata pipeline controller110, the interrelationships and connections shown inFIG.1, and so forth is illustrative of just one example of howdata pipeline controller110 may be organized and configured. For example, data pipelinecomponent discovery module118 may be split into two modules, with a separate module to keep track of active and inactive data pipelines, while data pipelinecomponent discovery module118 may continue to maintain an inventory of individual data pipeline components127. In still another example, an additional module may be provided to store previously processed requests as request templates, to store request templates and the associations between request templates and information models, to provide the request templates to clients, to obtain feedback on the matching of requests and/or request templates to information models (and/or the resulting data pipelines), to learn and update associations between request templates and information models, and so forth. Thus, these and other modifications are all contemplated within the scope of the present disclosure.
FIG.2 illustrates a flowchart of anexample method200 for generating a data schema for a type of data pipeline component and storing an ontology and the data schema for the type of data pipeline component in a catalog of data pipeline component types, in accordance with the present disclosure. In one example, themethod200 is performed by a component of thesystem100 ofFIG.1, such as by thedata pipeline controller110, and/or any one or more components thereof (e.g., a processor, or processors, performing operations stored in and loaded from a memory and comprising any one or more of the modules111-119). In one example, the steps, functions, or operations ofmethod200 may be performed by a computing device orsystem300, and/orprocessor302 as described in connection withFIG.3 below. For instance, the computing device orsystem300 may represent any one or more components of a data pipeline controller that is/are configured to perform the steps, functions and/or operations of themethod200. Similarly, in one example, the steps, functions, or operations ofmethod200 may be performed by a processing system comprising one or more computing devices collectively configured to perform various steps, functions, and/or operations of themethod200. For instance, multiple instances of the computing device orprocessing system300 may collectively function as a processing system. For illustrative purposes, themethod200 is described in greater detail below in connection with an example performed by a processing system. Themethod200 begins instep205 and proceeds to step210.
Atstep210, the processing system obtains a first ontology of a first type of data pipeline component. For instance, an operator of the processing system (e.g., a data pipeline controller) may define an ontology format such that a vendor providing a new type of data pipeline component may also provide an ontology associated with the new type of data pipeline component.
Atstep215, the processing system maps the first ontology to a second ontology for a second type of data pipeline component that is stored in a catalog of data pipeline component types. In one example, the mapping comprises determining a similarity between the second type of data pipeline component and the first type of data pipeline component. For example, the similarity may be based upon a congruence between the first ontology of the first type of data pipeline component and the second ontology of the second type of data pipeline component. For instance, the congruence (e.g., a metric or score that quantifies the extent of the matching) may be based upon a number of matches between features of the first ontology (e.g., at least one of classes, properties, or class restrictions) and features of the second ontology (e.g., at least one of classes, properties, or class restrictions). In one example, the congruence may have different weights for matches among different features, e.g., depending upon the level of a feature within a hierarchy, for example.
Atstep220, the processing system provides a second data schema for the second type of data pipeline component as a template for a first data schema for the first type of data pipeline component. For instance, step215 may identify a second type of data pipeline component that is most similar to the first type of data pipeline component. For example, both of these types of data pipeline components comprise “collectors.” In addition, in one example, both of these types of data pipeline components may be provided by a same vendor and/or may “match” with respect to one or more alternative or additional features. Accordingly, it may be observed that the second data schema is likely to provide the relevant configuration information for all or at least a significant portion of the available functions of the first type of data pipeline component.
Atoptional step225, the processing system may present the template to an operator, e.g., via an endpoint device of an operator. The presentation may include options for the operator to modify, and/or to approve or deny the adoption of the template as the first data schema. It should be noted that in one example, step220 may comprise identifying a plurality of best matching existing types of data pipeline components from the catalog andoptional step225 may comprise presenting the plurality of associated data schemas to the operator as template options.
Alternatively, the processing system may instead present the template(s) to an operator that is implemented by an automated system, e.g., a self-learning processing system or neural network. For instance, the automated system may comprise one or more artificial intelligence (AI) and/or machine learning (ML) modules which may be configured to analyze the template, to approve the template, and/or provide a modification to the template, and so forth. For instance, the automated system may be trained from past user behaviors regarding presented templates and modifications (or lack thereof) made to such templates. Over time, the automated system may learn and predict how certain modifications should be made in response to new templates that are presented. For example, several vendors of a similar type of component may have recently provided new versions which include functionality defined in a newly released industry standard for which new data schemas have already been created and/or obtained. When a next vendor releases its own new version of the same component type, the automated system may implement a similar change to the template so as to provide a new data schema (e.g., that incorporates changes to address the new functionality that is shared across all vendors newly released versions). In addition, feedback may be received over time regarding the automated decisions to further impact the learning of the automated system (e.g., via a reinforcement learning process), such that additional user observations may be omitted. It should be noted that such an automated system may be instantiated in accordance with any number of different machine learning models (MLMs) or machine learning algorithm(s). For example, a deep reinforcement learning (DRL) algorithm may be used in accordance with the present disclosure to train a deep neural network (DNN), such as a double deep Q network, and so forth.
Atoptional step230, the processing system may obtain at least one change to the template from the operator (e.g., from a human or from an automated system). For instance, the operator may be aware that the new “first” type of data pipeline component is an upgraded version of the older “second” type of data pipeline component and has at least one new function. In this case, the operator may alter the template so as to include the configuration information for the at least one new function. In one example, the operator may also select among a plurality of possible templates (e.g., if presented at optional step225).
Atoptional step235, the processing system may change the template in accordance with the at least one change. In one example, the processing system may send test instructions to at least one instance of the first type of data pipeline component to verify that the function added by the modification exists.
Atoptional step240, the processing system may obtain an approval of the operator to deploy the template (e.g., that is modified or unmodified) as the first data schema.
Atstep245, the processing system adds the first type of data pipeline component to the catalog of data pipeline component types, where the adding comprises storing the first ontology and the first data schema for the first type of data pipeline component in the catalog. In one example, the first data schema that is stored in the catalog may comprise the template that is changed atoptional step235. In one example, steps210-245 may include functions as described above in connection with the data schema generator/updater module116 and the ontology anddata schema repository115 ofFIG.1.
Atoptional step250, the processing system may identify a first information model that may be impacted by the adding of the first type of data pipeline component to the catalog, where the first information model comprises a flow sequence for a data pipeline (as well as data attribute relationships, in one example). For instance,optional step250 may comprise determining that the at least one information model includes at least one hook that identifies the second type of data pipeline component.
Atoptional step255, the processing system may provide at least one suggestion to an operator comprising at least one of: a suggestion to modify the first information model to incorporate the first type of data pipeline component or a suggestion to create a new information model (e.g., based upon the first information model and that incorporates the first type of data pipeline component). It should be noted that incorporating the first data pipeline component may comprise replacing the second data pipeline component or inserting the first data pipeline component (without replacing the second data pipeline component). It should also be noted thatoptional steps250 and255 may further apply to additional information models that may be identified as potentially being impacted by the adding of the first type of data pipeline component to the catalog. In one example,optional steps250 and255 may include functions as described above in connection with the information model updater/generator module113 and/orinformation model repository114 ofFIG.1.
Atoptional step260, the processing system may obtain a request for a delivery of a data set to at least one destination. In one example, the request may be in accordance with a request template. In one example, the request may comprise a plurality of parameters such as the desired data set, a specific data source or data sources, one or more target(s), a relevant time period for obtaining the data of the data set (e.g., for streaming and/or real-time data) and/or a relevant time period for which stored data is being requested, a specification of geographic bounds of the requested data set, one or more network regions for which data is being requested, other keywords, and so forth. In one example, the request may be formulated in accordance with a data definition language (DDL) that may be understood by the processing system.
Atoptional step265, the processing system may map the request to the first information model from among a plurality of information models. For instance, the first information model may comprise first metadata relating to at least one of a name, a region, a task type, and so forth. Similarly, the request may comprise second metadata relating to at least one of: the name (e.g., an identification of a one or more specific data sources and/or classes of data sources, one or more specific targets/destinations or classes of targets/destinations, an identifier of the requester and/or an organization of the requester, etc.), the region (e.g., a geographic indicator, an indicator of a portion of a network, a market segment, etc.), or the task type (e.g., “market intelligence,” “network load balancing,” “media event support,” etc.). As such, the mapping may comprise mapping the request to the first information model based upon a congruence between the first metadata and the second metadata. For instance, the congruence (e.g., a metric or score that quantifies the extent of the matching) may be based upon a number of matches between the metadata parameters.
Atoptional step270, the processing system may select a plurality of data schemas of a plurality of data pipeline component types in accordance with the first information model. For instance, the first information model may comprise hooks to the plurality of data schemas. In one example, optional steps260-270 may include functions as described above in connection with the request interpreter andfulfillment module111 and/orinformation model repository114 ofFIG.1.
Atoptional step275, the processing system may determine whether an existing data pipeline is available to handle the request. For example, the existing data pipeline may be determined to be available when the plurality of data pipeline components are arranged in same manner as indicated in the first information model (and hence the plurality of data pipeline components are of the correct data pipeline component types). Alternatively, the existing data pipeline may be determined to be available: (1) when it has the correct components that can be reconfigured to alternatively or additionally handle the current data delivery request, or (2) when the existing data pipeline does not have all of the specified components, but it has a sufficient number or percentage of the requisite components such that the processing system may select to modify/update this data pipeline to alternatively or additionally handle the current data request, rather than instantiate and arrange a new data pipeline.
The processing system may performoptional step280 when it is determined that an existing data pipeline is available to fulfill the request. Specifically,optional step280 may comprise transmitting instructions to a plurality of data pipeline components of the existing data pipeline in accordance with the plurality of data schemas to configure the plurality of data pipeline components for delivering the data set to the at least one destination. In one example,optional step280 may further include adding one or more additional data pipeline components to the data pipeline by transmitting instructions to the one or more additional data pipeline components in accordance with respective data schemas associated with the one or more additional data pipeline components to configure the one or more additional data pipeline components to function as part of the data pipeline for delivering the data set to the at least one destination.
On the other hand, the processing system may performoptional steps285 and290 when it is determined that no existing data pipeline is available to fulfill the request. Specifically,optional step285 may comprise determining an availability of a plurality of data pipeline components. For instance,optional step285 may comprise identifying the right data pipeline components of the right data pipeline component types, e.g., finding ones that are available, have capacity, are geographically proximate or provide the best latency or other performance considering where the data is located (at the source(s) or intermediate nodes where the requested data may have been previously copied), where the destinations is/are located, and so forth. The availabilities, capacities, proximity, and so forth may be determined based upon information stored by the processing system (such as by data pipelinecomponent discovery module118 ofFIG.1, for example).
Atoptional step290, the processing system may transmit instructions to the plurality of data pipeline components in accordance with the plurality of data schemas to configure the plurality of data pipeline components into a data pipeline, where the data pipeline is for delivering the data set to the at least one destination. In one example, optional steps275-290 may include functions as described above in connection with the request interpreter andfulfillment module111, data pipeline management andassembly module117, data pipelinecomponent discovery module118, and/orauthorization module112 ofFIG.1.
Followingstep280 oroptional step290, themethod200 proceeds to step295 where the method ends.
It should be noted that themethod200 may be expanded to include additional steps, or may be modified to replace steps with different steps, to combine steps, to omit steps, to perform steps in a different order, and so forth. For instance, in one example the processing system may repeat one or more steps of themethod200, such as steps210-245 for adding additional types of data pipeline components to the catalog, steps260-280 or steps260-290 for additional requests for delivery of data, and so forth. In one example, themethod200 may be expanded to include obtaining an ontology and a data schema for a new type of data pipeline component (such as from a vendor of the new type of data pipeline component) and adding the new type of data pipeline component to the catalog (e.g., without performingsteps215 and220, since the data schema is already provided). In one example, themethod200 may be expanded to include obtaining a request to search a catalog of data pipeline components and providing access to all or a portion of the catalog. In another example, themethod200 may be expanded to include providing one or more request templates to a client device and receiving a selection of one of the request templates. For instance, in such an example, the request that is obtained atoptional step260 may be in accordance with a request template. For example, a client, via a client device, may provide certain details which may be plugged-in to the template, such as specific dates, times, source(s), locations or regions, target(s), etc. In still another example, the order ofoptional steps235 and240 may be reversed. Thus, these and other modifications are all contemplated within the scope of the present disclosure.
In addition, although not expressly specified above, one or more steps of themethod200 may include a storing, displaying and/or outputting step as required for a particular application. In other words, any data, records, fields, and/or intermediate results discussed in the method can be stored, displayed and/or outputted to another device as required for a particular application. Furthermore, operations, steps, or blocks inFIG.2 that recite a determining operation or involve a decision do not necessarily require that both branches of the determining operation be practiced. In other words, one of the branches of the determining operation can be deemed as an optional step. However, the use of the term “optional step” is intended to only reflect different variations of a particular illustrative embodiment and is not intended to indicate that steps not labelled as optional steps to be deemed to be essential steps. Furthermore, operations, steps or blocks of the above described method(s) can be combined, separated, and/or performed in a different order from that described above, without departing from the example embodiments of the present disclosure.
FIG.3 depicts a high-level block diagram of a computing device or processing system specifically programmed to perform the functions described herein. For example, any one or more components or devices illustrated inFIG.1 or described in connection with theexample methods200 or500 may be implemented as theprocessing system300. As depicted inFIG.3, the processing system300 comprises one or more hardware processor elements302 (e.g., a microprocessor, a central processing unit (CPU) and the like), a memory304, (e.g., random access memory (RAM), read only memory (ROM), a disk drive, an optical drive, a magnetic drive, and/or a Universal Serial Bus (USB) drive), a module305 for generating a data schema for a type of data pipeline component and storing an ontology and the data schema for the type of data pipeline component in a catalog of data pipeline component types and/or for configuring data pipeline components for delivering a first data set to at least a first destination and for delivering a second data set to at least the second destination in accordance with a plan comprising a combination of a first information model associated with a first request and a second information model associated with a second request and including at least one modification to the combination, and various input/output devices306, e.g., a camera, a video camera, storage devices, including but not limited to, a tape drive, a floppy drive, a hard disk drive or a compact disk drive, a receiver, a transmitter, a speaker, a display, a speech synthesizer, an output port, and a user input device (such as a keyboard, a keypad, a mouse, and the like).
Although only one processor element is shown, it should be noted that the computing device may employ a plurality of processor elements. Furthermore, although only one computing device is shown in the Figure, if the method(s) as discussed herein is/are implemented in a distributed or parallel manner for a particular illustrative example, i.e., the steps of the method(s) or the entire method(s) are implemented across multiple or parallel computing devices, e.g., a processing system, then the computing device of this Figure is intended to represent each of those multiple computers. Furthermore, one or more hardware processors can be utilized in supporting a virtualized or shared computing environment. The virtualized computing environment may support one or more virtual machines representing computers, servers, or other computing devices. In such virtualized virtual machines, hardware components such as hardware processors and computer-readable storage devices may be virtualized or logically represented. Thehardware processor302 can also be configured or programmed to cause other devices to perform one or more operations as discussed herein. In other words, thehardware processor302 may serve the function of a central controller directing other devices to perform the one or more operations as discussed above (and/or below).
It should be noted that the present disclosure can be implemented in software and/or in a combination of software and hardware, e.g., using application specific integrated circuits (ASIC), a programmable logic array (PLA), including a field-programmable gate array (FPGA), or a state machine deployed on a hardware device, a computing device, or any other hardware equivalents, e.g., computer readable instructions pertaining to the method(s) discussed herein can be used to configure a hardware processor to perform the steps, functions and/or operations of the herein described method(s). In one example, instructions and data for the present module orprocess305 for generating a data schema for a type of data pipeline component and storing an ontology and the data schema for the type of data pipeline component in a catalog of data pipeline component types and/or for configuring data pipeline components for delivering a first data set to at least a first destination and for delivering a second data set to at least the second destination in accordance with a plan comprising a combination of a first information model associated with a first request and a second information model associated with a second request and including at least one modification to the combination (e.g., a software program comprising computer-executable instructions) can be loaded intomemory304 and executed byhardware processor element302 to implement the steps, functions or operations as discussed above in connection with theexample method200 and/or as discussed below in connection with theexample method500. Furthermore, when a hardware processor executes instructions to perform “operations,” this could include the hardware processor performing the operations directly and/or facilitating, directing, or cooperating with another hardware device or component (e.g., a co-processor and the like) to perform the operations.
The processor executing the computer readable or software instructions relating to the described method(s) can be perceived as a programmed processor or a specialized processor. As such, thepresent module305 for generating a data schema for a type of data pipeline component and storing an ontology and the data schema for the type of data pipeline component in a catalog of data pipeline component types and/or for configuring data pipeline components for delivering a first data set to at least a first destination and for delivering a second data set to at least the second destination in accordance with a plan comprising a combination of a first information model associated with a first request and a second information model associated with a second request and including at least one modification to the combination (including associated data structures) of the present disclosure can be stored on a tangible or physical (broadly non-transitory) computer-readable storage device or medium, e.g., volatile memory, non-volatile memory, ROM memory, RAM memory, magnetic or optical drive, device or diskette and the like. Furthermore, a “tangible” computer-readable storage device or medium comprises a physical device, a hardware device, or a device that is discernible by the touch. More specifically, the computer-readable storage device may comprise any physical devices that provide the ability to store information such as data and/or instructions to be accessed by a processor or a computing device such as a computer or an application server.
To further aid in understanding the present disclosure,FIG.4 illustrates aset400 of example scenarios of establishing a new plan for configuring data pipeline infrastructure via data blending in accordance with multiple requests for overlapping data. To illustrate,scenario 1 shows a first example relating to two requests having associated information models (information model 1 andinformation model 2, respectively). For instance,information model 1 may comprise a specification for an example data pipeline configuration for delivery of data from data sources A and E to target D. The data pipeline configuration according toinformation model 1 further includes intermediate nodes B and C, as illustrated. Similarly,information model 2 may comprise a specification for an example data pipeline configuration for delivery of data from data sources E and F to target I. The data pipeline configuration according toinformation model 2 further includes intermediate nodes G and H, as illustrated. In the present example, a data pipeline controller (e.g., via a data blending module (such as thedata blending module119 ofFIG.1)) may establish a new plan which may comprise a combination ofinformation model 1 andinformation model 2, along with at least one modification.
For instance, in theexample scenario 1 ofFIG.4, the new plan may include a configuration of node J to include the functions of node G, along with a new intermediate storage function (that is not specified in theinformation models 1 and 2). For instance, node J may store the data from source E for distribution to nodes B and H for respective data pipelines associated with the first and second requests. For example, the first request may be for streaming data from source A and from source E, while the time period for obtaining data from source E ends before the time period for obtaining data from source A. In addition, the second request may be for streaming data from sources E and F, where the time period for obtaining data from source E ends before the time period for obtaining data from source F. However, the data source E may be bandwidth-limited or may be in a geographic region or network zone that has a large variance in utilization at different times of day (e.g., daytime and overnight). In such case, the data pipeline controller (e.g., via the data blending module) may determine that it is more efficient (e.g., any one or more of: faster, less latency, less expensive bandwidth utilization, etc.) to copy the data from node E to an intermediate node J for temporary storage, and then to distribute to nodes B and H (e.g., at respective times when the last data from sources A and/or F are available, when the network utilization between nodes J, B, C, D and/or nodes J, H, and I is least costly, etc.).
In the next example,scenario 2, a first request may be associated withinformation model 1, which may comprise a specification for an example data pipeline configuration for delivery of data from data source A to target D. The first request may specify a first data set (data set 1) comprising data form source A between a time T1 and a time T2. Similarly, a second request may be associated withinformation model 2, which may comprise a specification for an example data pipeline configuration for delivery of data from data source A to target E. However, the second request may specify a second data set (data set 2) comprising data form source A between a time T3 and a time T4. As can be seen inFIG.4, these time periods overlap between T3 and T2. Thus, the requests are for at least a portion of the same data. In this case, both requests may be received by a data pipeline controller, which may determine that data blending is permitted, and which may then determine a new plan for configuring data pipeline components of a data pipeline infrastructure to fulfill both requests. For instance, in this case, the data pipeline controller (e.g., via a data blending module) may determine a new plan as illustrated inFIG.4 in which source A, and nodes B and C are shared by both requests (e.g., these components will be considered part of two data pipelines or a shared data pipeline that may be established in accordance with the new plan). From node C, the respectively requested data may be distributed to the targets D and E. For example, since there is a temporal overlap, the new plan may comprise a specification that calls for storage to be added at node C (if neither ofinformation models 1 and 2 specifies storage at node C) or for additional storage to be added at node C (e.g., ifinformation model 1 specifies storage at node C, but the designated amount of storage may be insufficient for a longer duration and/or quantity of data storage to additionally fulfill the second request for data set 2).
In the third example,scenario 3, a first request may be associated withinformation model 1, which may comprise a specification for an example data pipeline configuration for delivery of a data set (data set 1) from data source A to target D. A second request may be associated withinformation model 2, which may comprise a specification for an example data pipeline configuration for delivery of a data set (data set 2) from data source A to target G. In this example, the data pipeline controller (e.g., via a data blending module) may determine thatdata sets 1 and 2 partially overlap. For instance, both data sets include a same subset of data (data subset 2). For illustrative purposes, non-overlapping subsets (data subsets 1 and 3) that are exclusive to data set 1 anddata set 2, respectively, are also labeled inFIG.4. Continuing with the present example, it may be determined that A-E-F may be a most efficient path (e.g., least or lower cost, faster/reduced latency, higher bandwidth availability, etc.) and that nodes E and F may provide the same or similar functions as nodes B and C. Initially, it may be considered that all ofdata sets 1 and 2 should be routed via A-E-F, with final distribution from node F to targets D and G, respectively. However, as noted above, a data pipeline controller (e.g., via a data blending module) may ensure that shared plans via data blending do not violate an operator and/or client policies. In this case, a client associated with the first request and the first information model may have a policy that specifies a geographic restriction on processing of certain sensitive data, which may includedata subset 1, whereasdata subset 2 may not have a similar protection/restriction. In this case, a new plan may be selected which efficientlyroutes data subsets 2 and 3 via A-E-F to fulfill all of the second request, and to partially fulfill the first request. The remaining part of the requesteddata set 1 for the first request (i.e., data subset 1) may be routed via A-B-C, and from C to target D for final delivery. For instance, nodes B and C may be established in allowable geographic locations, in allowable vendor infrastructure (e.g., in specified cloud provider infrastructure(s)), and so forth, in accordance with the policy of the first client associated with the first request. Notably, the overall cost, latency, bandwidth utilization, or other factors may be improved by this split routing as compared to a plan comprisinginformation model 1 andinformation model 2 without coordination and without such modification as indicated in theexample scenario 3.
FIG.5 illustrates a flowchart of anexample method500 for configuring data pipeline components for delivering a first data set to at least a first destination and for delivering a second data set to at least the second destination in accordance with a plan comprising a combination of a first information model associated with a first request and a second information model associated with a second request and including at least one modification to the combination, in accordance with the present disclosure. In one example, themethod500 is performed by a component of thesystem100 ofFIG.1, such as by thedata pipeline controller110, and/or any one or more components thereof, such a data blending module119 (e.g., a processor, or processors, performing operations stored in and loaded from a memory). In one example, the steps, functions, or operations ofmethod500 may be performed by a computing device orsystem300, and/orprocessor302 as described in connection withFIG.3. For instance, the computing device orsystem300 may represent any one or more components of a data pipeline controller that is/are configured to perform the steps, functions and/or operations of themethod500. Similarly, in one example, the steps, functions, or operations ofmethod500 may be performed by a processing system comprising one or more computing devices collectively configured to perform various steps, functions, and/or operations of themethod500. For instance, multiple instances of the computing device orprocessing system300 may collectively function as a processing system. For illustrative purposes, themethod500 is described in greater detail below in connection with an example performed by a processing system. Themethod500 begins instep505 and proceeds to step510.
Atstep510, the processing system obtains a first request for a delivery of a first data set to at least a first destination. For instance, the processing system may comprise a data pipeline controller and may obtain the first request via a request interpreter and fulfillment module (such as request interpreter andfulfillment module111 ofdata pipeline controller110 ofFIG.1). Step510 may comprise the same or similar operations as described above in connection withoptional step260 of theexample method200.
Atstep520, the processing system maps the first request to a first information model of a plurality of information models. For instance, the first information model may comprise first metadata relating to at least one of a name, a region, a task type, a technology, and so forth. Similarly, the first request may comprise second metadata relating to at least one of: the name, the region, the task type, the technology, etc. As such, the mapping may comprise mapping the request to the first information model based upon a congruence between the first metadata and the second metadata. In one example, step520 may comprise the same or similar operations as described above in connection withoptional step265 of theexample method200.
Atstep530, the processing system obtains a second request for a delivery of a second data set to at least a second destination. For instance, step530 may comprise the same or similar operations as described above in connection with step510 (and/or in connection withoptional step260 of the example method200), but with respect to the second request, where the second request has different parameters from the first request.
Atstep540, the processing system maps the second request to a second information model of the plurality of information models. For instance, step540 may comprise the same or similar operations as described above in connection with step520 (and/or in connection withoptional step265 of the example method200), but with respect to the second request, where the second request has different parameters from the first request.
Atstep550, the processing system identifies that at least a portion of data is a part of both the first data set and the second data set. For example, the respective parameters of the first request and the second request may specify the data sets being requested, such as by including one or more of: an identifier or identifiers of the data source(s), the data set name/label, fields within the respective data sets, date/time range(s) being requested, and so forth. In one example, the identification of the at least the portion of the data being a part of both the first data set and the second data set may additionally be discerned by the processing system from one or both of the respective information models. For instance, the first request may indicate desired data of the first data set by indicating a database name. On the other hand, the second request may indicate desired data of the second data set by indicating a type of data being requested. However, the second information model may include a specification from which the database containing the type of data being requested may be identified (and which may be the same database as that which is identified by name in the first request). Thus, the processing system may determine that at least a portion of the data in the database is being requested as part of both first data set and the second data set.
Atstep560, the processing system determines a plan for configuring data pipeline components for delivering the first data set to the at least the first destination and for delivering the second data set to the at least the second destination, where the plan comprises a combination of the first information model and the second information model, and where the plan comprises at least one modification to the combination of the first information model and the second information model. In one example,step560 is performed in response to a determination that a data blending is permitted.
In one example, the at least one modification may include an omission of at least one data pipeline component that is present in at least one of the first information model or the second information model. For example, the processing system may determine that at least the portion of the data that is being requested according to both the first request and the second request may be stored at an intermediate node (e.g., a data pipeline component) such that the at least the portion of the data does not need to be again obtained from the source. Thus, the source and/or one or more other intermediate nodes/data pipeline components between the source and the intermediate node where the at least the portion of the data is stored may be omitted. In one example, the at least one modification may include an addition of at least one data pipeline component that is not present in the first information model and the second information model. For instance, an intermediate storage node may be added to store the at least the portion of the data, where the intermediate storage node would not be utilized according to the first information model and according to the second information model. This could be the case where the first data set and the second data set are partially overlapping in time (and where the overlap comprises the at least the portion of the data).
In one example, the at least one modification may include an alteration to at least one setting for at least one data pipeline component that is present in at least one of the first information model or the second information model. For instance, in one example, the alteration to the at least one setting may comprise changing a storage duration of at least the portion of the data at the at least one data pipeline component. In one example, the changing of the storage duration may comprise changing from a “no storage” setting to a “storage” setting (e.g., with a duration of the storage being specified). Alternatively, or in addition, the alteration to the at least one setting may include changing a location criteria for the least one data pipeline component. For instance, an intermediate node of a first data pipeline in accordance with the first information model may be specified to be located as close as possible to one or more preceding nodes, or to one or more following nodes, specified to be located intermediate between one or more preceding nodes and one or more following nodes, etc. However, the at least the portion of the data may be obtained from a node other than the source (or one or more sources), which may alter the ideal or preferred location(s) for the at least one data pipeline component (e.g., as would otherwise be specified according to the first information model or the second information model).
In one example, the at least one modification to the combination of the first information model and the second information is selected for the plan based upon a determination of a reduction in an overall number of data pipeline components (and/or intermediate processing steps) according to the plan as compared to the combination of the first information model and the second information model without the modification. In one example, the at least one modification is selected for the plan based upon a determination of a reduction in a network bandwidth utilization according to the plan as compared to the combination of the first information model and the second information model without the modification. In one example, the at least one modification is selected for the plan based upon a determination of a reduction in a latency of a delivery of at least one of the first data set or the second data set for the plan as compared to the combination of the first information model and the second information model without the modification. In one example, the at least one modification is selected for the plan based upon a determination of a reduction in a cost of a delivery of at least one of the first data set or the second data set for the plan as compared to the combination of the first information model and the second information model without the modification. It should be noted that other factors above can also result in reduced costs.
In one example, the processing system may create the plan by initially attempting to reduce number of data pipeline components by consolidating separate data pipeline components from two pipelines into a single function (a single data pipeline component that is shared by the two pipelines). The processing system may then select a location (or provide location selection criteria) for the shared data pipeline component. For instance, the location selection criteria can specify using an existing location, if one of the two data pipelines is already established and contains an instance of the data pipeline component, and otherwise placing the shared data pipeline component in a location that minimizes latency or maximizes throughput based upon location criteria of both data pipelines (e.g., locations of source(s) and/or target(s), and/or locations of preceding and/or following nodes according to the first information model and second information model (upon which the plan is initially based)).
In one example, the processing system may select to add storage when at least portion of data is identified as being from partial overlap in time of first data set and second data set. In addition, the processing system may further select between: (1) adding storage to an existing node, (2) adding storage to a node specified in the first data pipeline according to the first information model or to a node specified in the second data pipeline according to the second information model, or (3) adding a new node that is not specified in either of the first information model or the second information model. The selection may be based upon a calculated overall efficiency in data delivery (reduced latency) or based upon a calculated reduction in latency balanced with a cost of deployment of the new node. It should be noted that the cost can be monetary, or the cost can be additional resource utilization (e.g., processor, memory, available storage in the data pipeline environment or in a portion/region of the data pipeline environment, a number of available nodes of a particular type in the data pipeline environment or in a portion/region of data pipeline environment, additional network bandwidth incurred to store at the new node instead of stream directly to target(s), etc.).
In one example, the at least one modification is selected in accordance with an operator policy of an operator of the data pipeline environment. In one example, the operator policy balances a reduction in an overall number of data pipeline components with a reduction in a latency of a delivery of at least one of the first data set or the second data set. In one example, the at least one modification may be selected to balance reduced bandwidth with reduced latency. In one example, the selection of the at least one modification may involve balancing multiple factors. In addition, in one example, the balancing may include a weighting to favor and/or disfavor certain factors in contributing to a solution.
Atoptional step570, the processing system may verify that the at least one modification does not violate a client policy of a client associated with the first request or the second request. In one example, the client policy is contained in one of the first request or the second request. Alternatively, or in addition, the client policy may be maintained by the processing system on behalf of the client. The client policy may specify a restriction on at least one of: a location of at least one data pipeline component, a sharing of the at least one data pipeline component (e.g., among different data pipelines), or an access of other clients to at least a portion of the first data set or the second data set. It should be noted that in one example,optional step570 may include verifying that the modification does not violate policies of both (or several) clients (e.g., if different clients are associated with the respective requests).
Atstep580, the processing system configures the data pipeline components for delivering the first data set to the at least the first destination and for delivering the second data set to the at least the second destination in accordance with the plan. In one example, step580 may be performed via a data pipeline management and assembly (DPMA) module (such asDPMA module117 ofFIG.1). For example, step580 may include various operations as described in connection with any one or more of optional steps270-290 of theexample method200. For instance, as noted above, the plan may be processed similar to an information model, and may comprise a specification for selecting and configuring data pipeline components. In this case, the plan may be for configuring two data pipelines (which may alternatively or additionally be considered to be a single, shared data pipeline). However, the plan may still include hooks to the plurality of data schemas for a plurality of data pipeline components, specific configuration parameters, and so forth, similar to the information models upon which the plan is based.
It should be noted that in some cases, the plan may have more specificity (and hence less flexibility) in selecting and configuring data pipeline components as compared to the information models. For instance, the plan may be generated by the processing system via a data blending module, provided to request interpreter and fulfillment module, and forwarded to a data pipeline management and assembly (DPMA) module for execution. For example, as described above in connection with the example ofFIG.1, in order to select a specific instance of a data pipeline component to include in a data pipeline,DPMA module117 may evaluate criteria contained in the information model(s)/specifications in view of the current topology of thedata pipeline infrastructure120, the availability of data pipeline components127, the operator policy, the client polices, etc. However, as also noted above, in one example, thedata blending module119 may refer to the data pipelinecomponent discovery module118, in which case, thedata blending module119 may include specific data pipeline components in the plan. This may be in contrast to an information model, which may more generally specify a type of data pipeline component, location criteria, and settings/configurations to apply, where theDPMA117 may select the actual instance of the data pipeline component based upon these factors.
Followingstep580, themethod500 proceeds to step595 where the method ends.
It should be noted that themethod500 may be expanded to include additional steps, or may be modified to replace steps with different steps, to combine steps, to omit steps, to perform steps in a different order, and so forth. For instance, in one example the processing system may repeat one or more steps of themethod500, such as steps510-580 for additional requests for delivery of data, and so forth. In one example, themethod500 may be expanded to include processing additional requests for at least the portion of the data, e.g., a third request, a fourth request, etc., any or all of which may be requesting the same portion of data. In one example, themethod500 may include additional operations in connection with determining that a request is not requesting the same portion of data. For instance, themethod500 may include operations to fulfill the request via an associated information model, without coordination and modification via a shared plan for other requests. In another example, themethod500 may be expanded to include determining that a data blending is permitted. For instance, prior to step550 and/or step560, the processing system may determine that a “blending by default” applies to both requests. In one example, the processing system may determine that data blending is permitted for both requests according to respective client policies that may be stored by the processing system, and/or according to a policy of an operator of the data pipeline environment. Alternatively, or in addition, the processing system may determine that data blending is authorized via parameters contained in either or both of the first request and the second request. For instance, the first client may have a policy that is “no data blending by default.” However, the client policy may also allow that this default may be overridden by a specific authorization contained in the first request. In still another example, operations of themethod500 may be expanded to include (and/or combined with) v operations described above in connection with theexample method200 ofFIG.2. Thus, these and other modifications are all contemplated within the scope of the present disclosure.
In addition, although not expressly specified above, one or more steps of themethod500 may include a storing, displaying and/or outputting step as required for a particular application. In other words, any data, records, fields, and/or intermediate results discussed in the method can be stored, displayed and/or outputted to another device as required for a particular application. Furthermore, operations, steps, or blocks inFIG.5 that recite a determining operation or involve a decision do not necessarily require that both branches of the determining operation be practiced. In other words, one of the branches of the determining operation can be deemed as an optional step. However, the use of the term “optional step” is intended to only reflect different variations of a particular illustrative embodiment and is not intended to indicate that steps not labelled as optional steps to be deemed to be essential steps. Furthermore, operations, steps or blocks of the above described method(s) can be combined, separated, and/or performed in a different order from that described above, without departing from the example embodiments of the present disclosure.
While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described example embodiments, but should be defined only in accordance with the following claims and their equivalents.