US10805241B2

Movatterモバイル変換

Info

Publication number: US10805241B2
Application number: US15/408,206
Authority: US
Inventors: Yu Dong; Qingqing Zhou; Guogen Zhang
Original assignee: FutureWei Technologies Inc
Current assignee: FutureWei Technologies Inc
Priority date: 2017-01-17
Filing date: 2017-01-17
Publication date: 2020-10-13
Also published as: CN110169019B; EP3560148A1; EP3560148A4; US20180205672A1; WO2018133781A1; CN110169019A; EP3560148B1

Abstract

A computer-implemented method and system are provided, including executing an application programming interface (API) in a network switch to define at least one of one or more database functions, performing, using one or more processors, the one or more database functions on at least a portion of data contained in a data message received at the switch, to generate result data, and routing the result data to one or more destination nodes. A database function-defined network switch includes a network switch and one or more processors to perform a pre-defined database function on query data contained in data messages received at the switch, to produce result data, wherein the pre-defined database function is performed on the query data in a first mode of operation to a state of full completion, generating complete result data and no skipped query data, or to a state of partial completion, generating partially completed result data and skipped query data.

Description

FIELD OF THE INVENTION

The present disclosure is related distributed databases, and in particular to network switches and related methods used to route data between nodes of a distributed database system.

RELATED APPLICATIONS

This application is related to U.S. patent application Ser. No. 15/408,130, filed Jan. 17, 2017, and entitled “Best-Efforts Database Functions.”

BACKGROUND

A modern distributed database, for example a massively parallel processing (MPP) database, may deploy hundreds or thousands of data nodes (DNs). Data nodes in distributed database are interconnected by a network that includes network interface cards (NICs) on each node, network switches connecting nodes and other switches, and routers connecting the network with other networks, e.g., Internet. Data nodes often need to exchange data messages to carry out database operations (e.g., join, aggregation, and hash, etc.) when processing a query received by the database system. These data messages can be, for example, table row data, certain column data, intermediate aggregation results of grouping, maximum or minimum of a subset of certain table rows, or intermediate result of a hash join.

The data messages are routed by the switches in the network to be delivered to the destination data nodes. A data node may send a data message to some or all of the other data nodes in the network to fulfill an operation of a query. Since a conventional network switch is not aware of the contents of data messages it forwards, it may forward duplicated or unnecessary data messages, which results in the waste of tight and highly demanded network bandwidth and computation capacity on the destination data nodes.

SUMMARY

A computer-implemented method performed by a network switch, comprises executing an application programming interface (API) in the network switch to define at least one of one or more database functions, and performing, using one or more processors, the one or more database functions on at least a portion of data contained in a data message received at the switch, to generate result data, and routing the result data to one or more destination nodes.

A network switch comprises a non-transitory memory storage comprising instructions, and one or more processors in communication with the memory, wherein the one or more processors execute the instructions to execute an application programming interface (API) to define one or more database functions, perform the one or more database functions on data carried in data messages arriving at a network node, with the performing producing processed result data, and perform one or more network switch functions to route the processed result data, and/or the data carried in the data messages, to one or more destination nodes.

A database system comprises a database server configured to process a database query requiring data to be retrieved from one or more data storage sources, the retrieved data being carried in data messages; a plurality of network nodes connecting the one or more data storage sources and the database server, at least one of the network nodes comprising: a database functions handling logic unit performing a pre-defined database function on data carried in data messages arriving at a network node, with the performing producing processed result data; a network switch logic unit coupled to the database functions handling logic and performing one or more network switch functions to route the processed result data, and/or the data carried in the data messages, to one or more destination nodes; and an application programming interface (API) in communication with the network switch logic unit, with the API executing in the switch and defining the one or more functions.

A method comprises processing a database query requiring data to be retrieved from one or more data storage sources, the retrieved data being carried in data messages; performing a pre-defined database function on data carried in data messages arriving at a network node, with the performing producing processed result data; performing one or more network switch functions to route the processed result data, and/or the data carried in the data messages, to one or more destination nodes; and defining one or more of the database functions using an application programming interface (API).

Various examples are now described to introduce a selection of concepts in a simplified form that are further described below in the detailed description. The Summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In example 1, there is provided a computer-implemented method performed by a network switch, comprising executing an application programming interface (API) in the network switch to define at least one of one or more database functions; and performing, using one or more processors, the one or more database functions on at least a portion of data contained in a data message received at the switch, to generate result data, and routing the result data to one or more destination nodes.

In example 2, there is provided a method according to example 1 including storing, in a storage device, at least one database function rule used to perform the database function.

In example 3, there is provided a method according to examples 1 or 2 wherein the routing is performed by a network switch logic unit that performs at least one of routing, classification, or flow control functions.

In example 4, there is provided a method according to examples 1-3 further comprising including the result data in one or more data messages that are queued for forwarding to the one or more destination nodes.

In example 5, there is provided a method according to examples 1-4, wherein a destination node of the one or more destination nodes comprises a destination database node or a network switch node.

In example 6, there is provided a method according to examples 1-5 further wherein the database function is selected from an aggregation function, a caching function, a hashing function, a union/merge function or an ordering/ranking function.

In example 7, there is provided a network switch comprising a non-transitory memory storage comprising instructions, and one or more processors in communication with the memory, wherein the one or more processors execute the instructions to: execute an application programming interface (API) to define one or more database functions; perform the one or more database functions on data carried in data messages arriving at a network node, with the performing producing processed result data; and perform one or more network switch functions to route the processed result data, and/or the data carried in the data messages, to one or more destination nodes.

In example 8, there is provided a network switch according to example 7 further comprising a data storage configured to store at least one database function rule to perform the database function.

In example 9, there is provided a network switch according to examples 7 or 8 wherein performing the one or more network switch functions performs routing, classification, or flow control functions.

In example 10, there is provided a network switch according to examples 7-9 wherein the processed result data is included in one or more data messages that are queued for forwarding to the one or more destination nodes.

In example 11, there is provided a network switch according to examples 7-10 wherein a destination node of the one or more destination nodes comprises a destination database node or a network switch node.

In example 12, there is provided a network switch according to examples 7-11 further wherein the database function is selected from an aggregation function, a caching function, a union/merge function, or an ordering/ranking function.

In example 13, there is provided a database system comprising a database server configured to process a database query requiring data to be retrieved from one or more data storage sources, the retrieved data being carried in data messages; a plurality of network nodes connecting the one or more data storage sources and the database server, at least one of the network nodes comprising: a database functions handling logic unit performing a pre-defined database function on data carried in data messages arriving at a network node, with the performing producing processed result data; a network switch logic unit coupled to the database functions handling logic and performing one or more network switch functions to route the processed result data, and/or the data carried in the data messages, to one or more destination nodes; and an application programming interface (API) in communication with the network switch logic unit, with the API executing in the switch and defining the one or more functions.

In example 14, there is provided a database system according to example 13 further comprising a data storage configured to store at least one database function rule to perform the database function.

In example 15, there is provided a database system according to examples 13 or 14 wherein the network switch logic unit performs one or more of routing, classification, or flow control functions.

In example 16, there is provided a database system according to examples 13-15 further comprising the processed result data included in one or more data messages that are queued for forwarding to the one or more destination nodes.

In example 17, there is provided a database system according to examples 13-16 wherein the destination nodes are selected from a database server or a network switch node.

In example 18, there is provided a method comprising processing a database query requiring data to be retrieved from one or more data storage sources, the retrieved data being carried in data messages; performing a pre-defined database function on data carried in data messages arriving at a network node, with the performing producing processed result data; performing one or more network switch functions to route the processed result data, and/or the data carried in the data messages, to one or more destination nodes; and defining one or more of the database functions using an application programming interface (API).

In example 19, there is provided a method according to example 18 further comprising a repository of database function rules used to define the pre-defined database function.

In example 20, there is provided a method according to examples 18 or 19 further comprising adding to the data messages at least one instruction that specifies the database function performed on the data carried in the data messages.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a distributed database system according to an example embodiment.

FIG. 2 is a dataflow diagram of a distributed database system according to an example embodiment.

FIG. 3 is a database function-defined (DFD) network switch according to an example embodiment.

FIG. 4 is a flow chart of a process according to an example embodiment.

FIG. 5 is a dataflow diagram of a distributed database system according to an example embodiment.

FIG. 6 is a flow chart of a process according to an example embodiment.

FIG. 7 is a massively parallel processing (MPP) distributed database system according to an example embodiment.

FIG. 8 is a flow chart of a process according to an example embodiment.

FIG. 9 is a distributed database system according to an example embodiment.

FIG. 10 is a flow chart of a process according to an example embodiment.

FIG. 11 is a flow chart of a process according to an example embodiment.

FIG. 12 is a flow chart of a process according to an example embodiment.

FIG. 13 is a flow chart of a process according to an example embodiment.

FIG. 14 is a flow chart of a process according to an example embodiment.

FIG. 15 is a data flow diagram and process according to an example embodiment.

FIG. 16 is a data flow diagram and process according to an example embodiment.

FIG. 17 is a block diagram illustrating circuitry for clients, servers, cloud based resources for implementing algorithms and performing methods according to example embodiments.

DETAILED DESCRIPTION

In the following description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments which may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that structural, logical and electrical changes may be made without departing from the scope of the present invention. The following description of example embodiments is, therefore, not to be taken in a limited sense, and the scope of the present invention is defined by the appended claims.

The functions or algorithms described herein may be implemented in software in one embodiment. The software may consist of computer executable instructions stored on computer readable media or computer readable storage device such as one or more non-transitory memories or other type of hardware based storage devices, either local or networked. Further, such functions correspond to modules, which may be software, hardware, firmware or any combination thereof. Multiple functions may be performed in one or more modules as desired, and the embodiments described are merely examples. The software may be executed on a digital signal processor, ASIC, microprocessor, or other type of processor operating on a computer system, such as a personal computer, server or other computer system, turning such computer system into a specifically programmed machine.

Distributed Database with Database Function Defined (DFD) Network Switch

Referring toFIG. 1, there is illustrated an example embodiment of a distributeddatabase system100. According to one embodiment, the distributed database is optionally a massively parallel processing (MPP) database. As illustrated inFIG. 1,database system100 includes amaster host102 that hosts a master database, where the user data is distributed across data segments hosted on a plurality of segment hosts104,106, and108 that maintain respective segment databases. Themaster host102 includes a separate physical server with its own operating system (OS), processor, storage, and random access and/or read-only memory. In one example embodiment, there is no user data stored in themaster host102, but themaster host102 stores metadata about database segments in segment hosts104,106, and108 of the database. Segment hosts104,106, and108 each also include physical servers with their own OS, processor, storage and memory. As used herein, the term “processor” shall include both software-programmable computing devices and/or such as programmable central processing units (as for example shown inFIG. 17), hardware circuits that are not programmable such as ASICs, and/or devices such as FPGAs, that are configurable circuits.

Master host

102 and segment hosts104,106, and108, communicate through a network interface, such as a network interface card, to one or more database function-defined (DFD) network switches110. According to one example embodiment, aDFD network switch110 includes components that perform database functions, described below with respect toFIG. 3, and components to perform network switching functions. According to one embodiment, the network switching functions are performed by a multiport network bridge that uses hardware addresses to process and forward data at the data link layer of the Open Systems Interconnection (OSI) model. In another example embodiment, theDFD network switch110 can, in addition or in the alternative, process data at the network layer by additionally incorporating routing functionality that most commonly uses IP addresses to perform packet forwarding.

According to one embodiment, data is distributed across each

segment host

104,106 and108 to achieve data and processing parallelism. For example, this is achieved by automatically distributing data across the segment databases using hash or round-robin distribution. When aquery112 is issued by aclient computer114, themaster host102 parses the query and builds a query plan. In one example embodiment, the query is filtered on a distribution key column, so that the plan will be sent to only to the segment database(s)104,106 and108 containing data applicable for execution of the query.

Referring now toFIG. 2, there is illustrated a data flow diagram of adatabase system200, wherein a database function or operation may require data exchanges through aDFD network switch110 amongdifferent data nodes 0 to N. According to one embodiment, a data node 0-N can take the form, for example, of a database server such asmaster host102, or a data source such as, but not limited to, a data storage system such as segment hosts104,106 and108, ofFIG. 1. A distributed database may deploy hundreds or thousands of data nodes 0-N. These data nodes are, for example, interconnected by a plurality of network switches, including but not limited to DFD network switches110 in this embodiment, connecting nodes and other switches, and routers connecting the network with other networks, for example the Internet. In this example embodiment,data messages202 originating from a data node 0-N are routed by theswitches110 in the network to be delivered to destination data nodes 0-N. According to one embodiment,data messages202 are encapsulated in network packets, and contain, for example, table row data, column data, intermediate aggregation results of grouping, maximum or minimum of a subset of certain table rows, or an intermediate result of a hash join, as will described in more detail below.

In the example ofFIG. 2, aDFD network switch110 operates in a first mode of operation, not using its database function capabilities, to routedata messages202 from

nodes

0, 2, 3, 4, 5, N, todata node 1, to fulfill an operation of a database query. For example,data node 1 may require data for a database function, i.e., retrieval ofdistinct values204 from all other data nodes. For this function, each of the data nodes sends individual intermediate results of distinct values, indata messages202 contained through the switch, tonode 1. In this first mode of operation, the transmission of these distinct values are “transparent” to the switch. In other words, theswitch110 routes the messages in a conventional fashion without performing database functions on or with thedata messages202, as explained in more detail below. All of thesedata messages202 are thus forwarded to and received bynode 1. In many cases, however, the majority of thedata messages202 from the various data nodes are redundant—i.e., contain values not distinct from values contained in other messages, as illustrated wherein for example the value “7” originates from five different nodes- 0, 2, 4, 5, and N. As a result, network bandwidth and computation capacity ofnode 1 are wasted. Because, for example, a distributed database system may have hundreds, thousands, or even millions of such database functions being concurrently performed by all the data nodes, wasted or redundant messaging can have a large impact on and result in sub-optimal overall database system performance.

As referred to above, and as illustrated inFIG. 3 andFIG. 4,DFD network switch110 can carry out database operations as well as perform conventional switching and routing functions. As explained below and illustrated inFIG. 3,DFD network switch110/300 includes one or more database functions definition rule application programming interfaces (APIs)302, database functionsrules repository304, a database functions handlinglogic unit306, a network switchcore logic unit308, and aswitch fabrics hardware310.FIG. 4 illustrates aprocess400 illustrating the operation of the

components

302,304,306 and308.

The set ofAPIs302 is provided to configure the rules for the switch to handle and process certain types of database functions associated with thedata messages202. According to one embodiment, “configuring” the rules includes creating, defining, modifying, or deleting the rules for database functions. As illustrated inFIG. 4, theAPIs302 allow a distributed database system, such asdatabase system100 ofFIG. 1, to create and maintain (at operation402) customized and extendable rules that are stored (at operation404) in database functionsrules repository304. For example, the rules in therule repository304 can be dynamically created, modified, or removed viaAPIs302. This enables support for different distributed database systems that may have different database functions or operations, as well as different formats of the data messages being exchanged. When defining a rule for a database functions,APIs302 specify (at operation406) the query data format, output data format, as well as internal processing logic. The database functions that can be defined by rules may include, but not limited to, the following: aggregation (e.g., distinct, sum, count, min, max, etc.); caching of exchange data (e.g., intermediate results, hash table, etc.); union/merge of results; and ordering/ranking of data, for example.

As referred to above,rule repository304 stores rules for database functions, wherein the rules can be dynamically created, modified, or removed viaAPIs302 described above, or otherwise introduced into therepository304. In one example embodiment, adata message202 carries a rule identifier or information identifying a rule so that theswitch110, upon receiving (408) network packets encapsulating data messages, is able to locate (410) the rule in itsrule repository304.

Once theswitch110 locates410 the applicable rule or rules inrule repository304, thedata messages202 are then processed411 by the database function handlinglogic unit306 to perform the pre-defined database functions. The execution offunction logic unit306 is carried out412 byswitch fabric hardware310. After the functions are performed, the resultingdata messages202 are queued414 for the switch'score logic unit308 to forward416 to the destination data nodes or next data nodes such as switches110.

The network switchcore logic unit308 in theswitch110 performs the common functionalities of a network switch, e.g., routing, classification, flow control, etc. Theunit308 serves as the basic component of a network switch and is shared by both conventional network switches and the architecture of theDFD network switch110.

Theswitch fabrics hardware310 includes the general hardware being utilized by conventional network switches, e.g., processor, memory, it also, in one example embodiment, incorporates specialized hardware, such as but not limited to, a co-processor, field programmable gate arrays (FPGAs), and/or application specific integrated circuits (ASICs), to efficiently perform certain database functions. Such functions include but are not limited to, hash calculation, sorting, encryption/decryption, compress/decompress, etc. With the specialized hardware, the performance of processing data messages and performing database functions is to improve significantly. However, such specialized hardware is only optional for the purpose of better performance while the majority of the defined database functions can be done without them.

The data flow diagram ofFIG. 5 and theprocess600 illustrated in the flow chart ofFIG. 6 illustrate an example embodiment wherein aDFD network switch110 in a distributeddatabase system200 operates in a database functions defined mode. In this example, instead of transparently forwarding all theindividual data messages202, redundant or not, to thedestination data node 1, theDFD network switch110 processes (at operation603) thedata messages202 from allother data nodes 0, 2-N, and only forwards (at operation604) the resultingdata messages504 containing the unique values502 to thedestination data node 1. This saves the network bandwidth and computation capacity on thedestination data node 1. Furthermore, with the help of the specialized hardware, the process overhead and delay can be largely reduced. Thus the overall performance of the same database function, for example, retrieving distinct values from previous database operations, can be improved accordingly.

Referring now toFIG. 7, there is illustrated an embodiment of aDFD network switch110 deployed as anetwork node706 of a massively parallel processing (MPP)database infrastructure700, wherein acoordinator node702 may be, for example, a database host such asmaster host102 ofFIG. 1, and adata node704 may be a data source such as a

segment host

104,106, and108, also discussed with respect toFIG. 1. In this example embodiment, there is also provided anoptimizer708 and anexecutor710, operative on a distributed database system, to plan and coordinate the use of the database functionality in DFD network switches110. According to one embodiment, illustrated inFIG. 7,coordinator node702 contains or utilizes both anoptimizer708 and anexecutor710, anddata node704 contains or utilizes only anexecutor710. According to one example embodiment discussed below with respect toFIG. 11,optimizer708 accesses information concerning the capabilities of thenetwork nodes706 stored in is a distributed database system catalog table. According to another example embodiment discussed below with respect toFIG. 12,executor710 obtains query plan information fromoptimizer708, and uses the query plan information to execute database query operations.

While this example embodiment shows theDFD network switch110 deployed in a distributeddatabase infrastructure700, theDFD network switch110 is not limited to application in distributed database infrastructures, and may be deployed in other database infrastructures or systems. In one example embodiment, theoptimizer708 and theexecutor710 are resident on (and execute on) a database server system such asdatabase server102, which may be deployed for example as acoordinator node702 in the system ofFIG. 7.

In this example embodiment, the DFD network switches110 perform not only conventional network routing and switching functions, to route data messages among data nodes, for example betweencoordinator nodes702 anddata nodes704, but also perform pre-defined database functions, such as referred to above and described more below, that reduce and optimize the network communication among these data nodes. The DFD network switches110 acting asnetwork nodes706 thus optimize database operations performance. Thus, in this embodiment and others described herein, theDFD network switch110 is not just a network device transparent to database system, but actively performs database functions and operations on data being routed through the network.

Optimizer and Executor

According to one example embodiment, as noted above, there is provided anoptimizer708 and anexecutor710, operative on a distributed database system, to take advantage of the database functionality in DFD network switches110. As noted above, according to one embodiment,coordinator node702 contains or utilizes bothoptimizer708 andexecutor710, anddata node704 contains or utilizesonly executor710. Also as noted above, a database function or operation is defined in theDFD network switch110. Such database functions include, but are not limited to: (i) aggregating intermediate results from data nodes, (ii) buffering data and building a hash table, (iii) ordering and ranking certain data, as well as (iv) making a union of or merging intermediate results.

According to an example mode of operation illustrated in theprocess flow chart800 ofFIG. 8, theoptimizer708 makes a decision (at operation802) whether to take advantage of DFD network switches110 when it selects the optimal plan for a query. If theoptimizer708 identifies (at operation804) a certain database operation can benefit from one or more database operations in DFD network switches110, it asks thedata nodes704 to mark (at operation806) data messages it sends and transmits them (at operation808) with pre-defined flags to identify the data operations to be performed by the DFD network switches110.

When the data messages carrying the matched function arrive at the node, the database function is performed (at operation810) by the software and hardware of theDFD network switch110, described in more detail herein below. The final or intermediate results are then forwarded (at operation812) to the destination data nodes (coordinator nodes or data nodes) or next switches, or DFD network switches110, depending on the network interconnection topology and query plan. As a result, the network traffic is optimized for the distributed database, for example resulting in reduced data to transport and thus reduced bandwidth requirement. Furthermore, the time to process data messages and the corresponding computation time on the associated data can be greatly saved on destination data nodes.

As noted above, in most scenarios, a distributed

database system

100,200 or700 may include more than tens of data nodes, or even hundreds or thousands of data nodes. In such cases, according to one embodiment, multiple DFD network switches110 are deployed and inter-connected in a hierarchical or tree-like network topology900 illustrated inFIG. 9. As illustrated in theprocess flow chart1000 ofFIG. 10, theupstream switches904 receive (at operation1002) the data messages902 (such as message202) from the sending data nodes, and then process (at operation1004) the data messages using the pre-defined database function-defined rules stored in a rules repository304 (seeFIG. 3). The resulting data messages are forwarded (at operation1006) to thedownstream switches906 on the routing path of thedata messages902. Upon receiving thedata messages202 fromupstream switches904, thedownstream switches906 process (at operation1008) the data messages again using the pre-defined database function definition rules associated with thedata messages902, and then forward (at operation1006) the new resulting data messages to theirdownstream switches906 on the routing path of the data messages. The process continues until the data messages reach (at operation1010) the destination908. The embodiment ofFIG. 9 andprocess1000 thus illustrates such a case of multiple DFD network switches110, where the database functions are performed at eachswitch110 on the routing path of thedata messages902.

TheDFD network switch110 also handles the transport layer control messages associated with thedata messages902 it processes at1004 and1008. As an example, for the connection oriented transmission, it sends back the control information like ACK to the source node on behalf of the destination nodes if it processes and aggregates theoriginal data messages202. For the connectionless transmission, the processed data contains the original message ID. In either case, the distributeddatabase executor710 is aware of the process and handles the follow-up process, as explained below with respect to an example embodiment of an MPP executor design.

As described in more detail herein below, because DFD network switches110 may have limited system resources, for example but not by way of limitation, limited memory/cache size and/or constrained computation power, it is possible that the database functions or operations on DFD network switches110 cannot keep pace with or catch up to the speed/rate of data streaming expected for the main data routing task of the switch. In such a case, according to one embodiment, the DFD network switches110 receive streaming query data contained in data messages, and only perform operations/functions on the query data that is within its capacity within a specified or desire time frame, and forward the partial processed results to the next destination nodes, together with the “skipped”, unprocessed, query data.

According to one embodiment, skipped data bypasses any components of theswitch110 used to perform database functions, or alternatively is input to such components but is output without being processed. These types of database operations are defined herein as “best-effort operations.” In other words, a respective database function can be performed to a state of completion that is a completed state including complete result data or to a partially performed, incomplete state, including incomplete result data. If the resources of aDFD network switch110 is sufficient to complete the desired database function in the switch, then it is performed to a completed state. In a first mode of operation, if the resources are insufficient to perform the desire database function on all available data within a time frame, such as a desired or specified time frame, then with “best-effort” operation theDFD network switch110 only performs the desired database function on as much data as resources allow, and passes along the unprocessed, incomplete data, together with the processed, completed data. In another mode of operation, the database function is performed to the completed state if sufficient resources are available. Any distributed database operations involving DFD network switches110 can be potential candidates to operate as and be categorized as best-efforts operations. An example embodiment of an algorithm for different best-effort operations are described further herein below.

According to another example embodiment, the optimizer selects (at operation1108) the optimal query plan based on the cost estimation with and/or without DFD network switches110. Costs of query plan both with and without DFD network switches110 and best-effort operations are estimated and saved inoptimizer708's plan choices search space. Using an effective searching algorithm, theoptimizer708 selects (at operation1108) the most efficient query plan and decides whether to include best-effort operations or not. Based the optimal query plan it selects (at operation1108), the optimizer generates plan operators of best-effort operations. Once the optimal query plan is decided, optimizer transforms (at operation1110) the best-effort operations into individual best-effort operators, e.g., best-effort aggregations, best-effort hash, etc. The details of best-effort operations and operators are described in more detail below.

According to another example embodiment, aprocess1200 illustrated in the flow chart ofFIG. 12, is performed by anexecutor710 in the distributed database, to coordinate or direct database operations in the DFD network switches110. Theexecutor710 identifies (at operation1210) the best-effort operators in the query plan, and executes the corresponding processing logic. According to one example embodiment, theexecutor710 concurrently executes multiple best-effort operators based on a scheduling strategy to improve system utilization. As illustrated inFIG. 12, theexecutor710, in one embodiment, prepares (at operation1210) the data in appropriate format for best-effort operations. Each data message is tagged (at operation1220) with the operation and database function-defined rule IDs that can be identified by the DFD network switches110, along with necessary control information (e.g., corresponding plan operator information, message sequence IDs). Theexecutor710 schedules (at operation1230) data exchanges with the connections involving both DFD network switches110 and data nodes. Theexecutor710 sets up (at operation1236) virtual connections for data exchanges, and schedules the transmission of the data messages upon its availability. Theexecutor710 processes data received for best-effort operations from both DFD network switches110 and data nodes. After receiving the data messages, theexecutor710 processes (at operation1240) the data and fulfills the best-effort operations if they are not fully accomplished by the DFD network switches110. In one embodiment, atoperation1260, when a best-efforts operator message is received, if it is an aggregated message from DFD network switches110, the original data messages' IDs are encoded (at operation1270) so that theexecutor710 can identify the missing data messages in case transmission error occurs.

Thus, as described above, the disclosed embodiments provide more abundant and complete network switch infrastructure and enhanced functionalities of a distributed database, and further the DFD network switches110 require no hardware changes on data nodes, while the hardware customization on switches is only optional to further improve the performance.

Best-Effort Processing on DFD Network Switch

Moreover, as described further below, there are provided example embodiments of logic for best-effort aggregation, best-effort sorting, and best-effort hash join, which are three major categories of performance-critical distributed database operations. These operations are major performance-critical operations in distributed database.

A flow chart of an example embodiment ofprocessing logic1300 of a best-effort aggregation algorithm is illustrated inFIG. 13. Here, aggregation processing is an abstraction of any one of the specified MPP aggregation operations mentioned hereinabove, for example, the DISTINCT, SUM, MIN/MAX or other, operations. These operations share the same best-effort operation processing flow.

The first step in aggregation processing is to determine (at operation1310) if there are enough resources to perform all the desired aggregation, for example by checking if the memory, cache, and computation power can satisfy the requirement to carry out the desired best-efforts aggregation. If there are enough resources, the aggregation is carried out (at operation1320). If not, some or all of the data that could have been aggregated had enough resources been available is forwarded (at operation1330). If more streaming data has arrived that is seeking aggregation (at operation1340), the process returns to check for enough resources atoperation1310. If there is no more streaming data to aggregate, the availability of aggregation results is determined (at operation1360), and if so, the aggregated results are forwarded (at operated1370), and if no results available, no results are forwarded. The aggregation operation finishes atoperation1380.

Sorting in a distributed database is in some cases a resource-intensive computation, so aDFD network switch110 may be unable to finish the entire process of sorting all the data transmitted through it. Accordingly, in one example embodiment of a best-effort sorting process1400 illustrated in the flow chart ofFIG. 14, the best-effort sorting will not process all the data in one pass, but will try to separate and process the data in consecutive rounds of “mini-batches”. According to one embodiment, during each round only the data within its processing capacity (which is termed a “mini-batch” herein) will be processed and the intermediate results will be transmitted to the next destination DFD network switches110, coordinator node or data node.Process1400 starts by determining (at operation1410) the sufficiency of resources to perform the desire sort. If there are inadequate resources, the data is forwarded (at operation1460). If there are sufficient resources, the process determines (at operation1420) if the limit of the mini-batch size has been reached. According to one embodiment, the size of a mini-batch is the upper limit or a threshold that a mini-batch may hold and process the data by a DFD network switches110. If the process has not hit the limit, the batch is processed (at operation1450), and if it has hit the limit, the mini-batch result is forwarded (at operation1430), and a new mini-batch is formed (at operation1440), and then processed (at operation1450).

If (at operation1470) more streaming data is ready to be processed, the process returns tooperation1410. If not, the process determines if (at operation1480) mini-batch results are available, and if so, the results are forwarded (at operation1490), and if not, the process finishes (at operation1496). This process thus logically divides the streaming data into small processing bunches within the DFD network switch's resources limit. According to an example embodiment, distributed database operations that can leverage best-effort sorting include, but are not limited to, order, group and/or rank. Each of these sorting operations may incorporate individual different sorting algorithms, e.g., hash sort, quick sort, tournament sort, etc. These detailed sorting algorithms are mature and readily known to those of skill in the art.

When a hash join is contained out in a distributed database, one of the commonly employedprocesses1500, as illustrated in the data flow diagram ofFIG. 15, wherein for example network switches may be deployed as the switching network and operated in a first, conventional mode of operation, is that the inner table of the hash join are broadcasted to all the data nodes involving the hash join. In this process, each data node broadcasts its own local data of the inner table and receives the inner table data from all other data nodes N to build a complete hash table. Then the local outer table data join the rebuilt inner table by probing the hash table. For example, for the hash join in the distributed database system that is illustrated inFIG. 15, the same local inner table data from each data node (e.g., node 1) is broadcast (n−1) copies to all other (n−1) data nodes. Then, after receiving all the inner table data from all other (n−1) data nodes, each data node builds the same hash table on inner table data, which means the same hash table building process is to be repeated N times in the whole cluster. This whole process wastes significant network bandwidth and computation capacity.


Pseudo Code Example for Hash Table Merge Processing

/* Hash table processing algorithm at network nodes (NN) */

NN_Build_Hash_Table( ){

while (ingress_data data != null){

if (enough_resource( )){

	Check_data_flag(data);
	if (data−>type == raw_data)

	hash_table = Build_Hash_Table(data−>tableID,
	data);

else if (data−>type == hash_data)

	hash_table = Merge_Hash_Table(data−>tableID,
	data);

else

Error(data);

Free(data);

	}
	Else{

	/* mark the destination as well data type, etc. */
	Make_flag(hash_table);
	/* put the hash table or data to destination queue */
	Enqueue(egress, hash_table);
	Enqueue(egress, data);

}

	}
	if (hash_table != null){

	/* mark the destination as well data type, etc. */
	Make_flag(hash_table);
	/* put the hash table or data to destination queue */
	Enqueue(egress, hash_table);

}

Accordingly, in the above example embodiment, instead of sending and receiving inner table data to/from all other (n−1) data nodes, in a best case scenario, each data node can reduce its data exchange to only oneDFD network switch110, without a need to build a hash table locally, which can save significant network bandwidth and computation capacity of each data node.

FIG. 17 is a block diagram illustrating circuitry for performing methods according to example embodiments. In particular, in one example embodiment, computing devices as illustrated inFIG. 17 are used to implement the data nodes described above, themaster host102, the segment hosts104, the DFD network switches110, the DFD rulesAPIs302, the database functionsrules repository304, the database functions handlinglogic unit306, the network switchcore logic unit308, and/or theswitch fabrics hardware310. However, not all components shown inFIG. 17 need be used in all of the various embodiments. For example,database system100 andDFD network switch110 may each use a different sub-set of the components illustrated inFIG. 17, or additional components.

One example computing device in the form of acomputer1700 may include aprocessing unit1702,memory1703, removable storage1710, andnon-removable storage1712. Although the example computing device is illustrated and described ascomputer1700, the computing device may be in different forms in different embodiments. For example, the computing device may instead be a smartphone, a tablet, smartwatch, or other computing device including the same or similar elements as illustrated and described with regard toFIG. 17. Devices, such as smartphones, tablets, and smartwatches, are generally collectively referred to as mobile devices or user equipment. Further, although the various data storage elements are illustrated as part of thecomputer1700, the storage may also or alternatively include cloud-based storage accessible via a network, such as the Internet or server based storage.

Memory

1703 may includevolatile memory1714 andnon-volatile memory1708.Computer1700 may include—or have access to a computing environment that includes—a variety of computer-readable media, such asvolatile memory1714 andnon-volatile memory1708, removable storage1710 andnon-removable storage1712. Computer storage includes random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM) or electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, compact disc read-only memory (CD ROM), Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium capable of storing computer-readable instructions.

Computer

1700 may include or have access to a computing environment that includesinput interface1706,output interface1704, and acommunication interface1716.Output interface1704 may include a display device, such as a touchscreen, that also may serve as an input device. Theinput interface1706 may include one or more of a touchscreen, touchpad, mouse, keyboard, camera, one or more device-specific buttons, one or more sensors integrated within or coupled via wired or wireless data connections to thecomputer1700, and other input devices. The computer may operate in a networked environment using a communication connection to connect to one or more remote computers, such as database servers. The remote computer may include a personal computer (PC), server, router, network PC, a peer device or other common DFD network switch, or the like. The communication connection may include a Local Area Network (LAN), a Wide Area Network (WAN), cellular, WiFi, Bluetooth, or other networks. According to one embodiment, the various components ofcomputer1700 are connected with asystem bus1720.

Computer-readable instructions stored on a computer-readable medium are executable by theprocessing unit1702 of thecomputer1700, such as aprogram1718. Theprogram1718 in some embodiments comprises software that, when executed by theprocessing unit1702, performs network switch operations according to any of the embodiments included herein. A hard drive, CD-ROM, and RAM are some examples of articles including a non-transitory computer-readable medium such as a storage device. The terms computer-readable medium and storage device do not include carrier waves to the extent carrier waves are deemed too transitory. Storage can also include networked storage, such as a storage area network (SAN).

Thus, as described above, the embodiments described herein provide an advantageous switch and network of switches in a distributed database system, and an innovative infrastructure for a distributed database which includes special DFD network switches110 beside conventional coordinator nodes and data nodes. Instead of just routing and forwarding data messages as a conventional network switch does, in one example embodiment the DFD network switches110: i) define database functions to be performed as rules via a set of APIs; ii) dynamically maintain the supported database functions rules in a repository; iii) perform the database functions on data messages matching pre-defined rules; and/or iv) forward intermediate results to destination node or next switches.

Further, there are described herein example embodiments of an infrastructure of a distributed database including database functions-defined (DFD) switches including processing logic and algorithms to carry out three major best-effort performance critical distributed database operations: aggregation, sorting and hash join. The operation of distributed database takes advantages of such data nodes so that unprocessed or partially processed data can be continuously processed in a best-effort manner by the downstream data nodes, and eventually processed by the destination coordinator or data nodes with much reduced and processed data. Accordingly, with the example embodiments of the best-effort operations for a distributed database, the DFD network switches110 in an infrastructure of a distributed database are leveraged to optimize network traffic, reduce data transfer and bandwidth requirements, and save computation capacity on coordinator and data nodes. The overall distributed database system performance can thus be improved.

Although a few embodiments have been described in detail above, other modifications are possible. For example, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable results. Other steps may be provided, or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems. Other embodiments may be within the scope of the following claim.