CROSS REFERENCE TO RELATED APPLICATIONS The present application is related to U.S. patent application Ser. No. ______, attorney docket no. YOR920040154US1, entitled “DISTRIBUTED MESSAGING SYSTEM SUPPORTING STATEFUL SUBSCRIPTIONS,” filed on an even date herewith, assigned to the same assignee, and incorporated herein by reference.
BACKGROUND OF THE INVENTION 1. Technical Field
The present invention relates to data processing systems and, in particular, to messaging systems in a distributed processing environment. Still more particularly, the present invention provides a distributed messaging system supporting stateful subscriptions.
2. Description of Related Art
A publish-subscribe messaging middleware is a system in which there are two types of clients. Publishers generate messages, also referred to as events, containing a topic and some data content. Subscribers request a criterion, also called a subscription, specifying what kind of information, based on published messages, the system is to deliver in the future. Publishers and subscribers are anonymous, meaning that publishers do not necessarily know how many subscribers there are or where they are and, similarly, subscribers do not necessarily know where publishers are.
A topic-based, or content-based, publish-subscribe system is one in which the delivered messages are a possibly filtered subset of the published messages and the subscription criterion is a property that can be tested on each message independent of any other message. For example, a filter may determine whether “topic=stock-ticker” or “volume>1000.” Content-based or topic-based publish-subscribe systems are referred to herein as “stateless.”
There are pre-existing and emerging alternative technologies to solve the deficiencies of content-based publish-subscribe systems. Message mediators may be introduced into the flow of traditional messaging middleware. This is a useful concept; however, in their current manifestations, mediators are complex to program, require external database services in order to store and access state, and groups of mediators cannot be automatically combined.
Traditional database systems may also be used. Each published message can be fed to a database and give rise to a cascade of transactions updating the message history. Subscriptions can be expressed as views of these histories. Technologies are being developed to allow views to be updated incrementally. Such an approach is easier to program; however, it can be costly and slow if each new message results in a transaction involving a large number of subscribers.
An emerging technology still being researched is continuous queries on data streams. These approaches preserve the simpler programming model of the database system approach above and attempt to reduce the cost of traditional databases by a combination of approaches, including batching message updates and restricting the available operations to ones allowing the use of bounded-sized, in-memory sliding windows. However, this approach is restricting and limited.
SUMMARY OF THE INVENTION The present invention solves the disadvantages of the prior art and provides a distributed messaging system supporting stateful subscriptions. A stateful publish-subscribe system extends the functionality of the content-based approach to include more general state-valued expressions. Stateful subscriptions may refer to one or more message histories and may include more complex expressions. Therefore, subscribers may receive different information than that provided in the published messages. A plurality of broker machines is provided to deliver messages sent by publishing clients toward subscribing clients based upon the contents of the messages and stateful transformations requested by the subscribing clients. These broker machines form an overlay network. Subscription specifications are either given as a directed transformation graph (with nodes being the transforms required to derive the subscription from published messages and edges representing the ordering of the transforms) or in a language that can be analyzed by a compiler and converted into a collection of message transforms and views. The messaging system builds a structure containing all message transforms and views needed for all intermediate and subscribed views of all subscriptions.
The messaging system uses this structure to allocate message transforms and views to broker machines in the overlay network. A deployment service component deploys tasks to optimize system performance. A monitoring services component detects a possible need to reconfigure. A performance optimization service component computes new assignment of transforms. A continuous deployment service implements a redeployment protocol that installs changes to transform placement while the existing publish-subscribe system continues to operate.
BRIEF DESCRIPTION OF THE DRAWINGS The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
FIG. 1 depicts a pictorial representation of a network of data processing systems in which the present invention may be implemented;
FIG. 2 is a block diagram of a data processing system that may be implemented as a server in accordance with a preferred embodiment of the present invention;
FIG. 3 is a block diagram of a data processing that may serve as a client of a service in accordance with a preferred embodiment of the present invention;
FIG. 4 illustrates a broker network for a publish-subscribe system in accordance with a preferred embodiment of the present invention;
FIG. 5 illustrates how a stateful publish-subscribe service of the present invention appears to clients;
FIG. 6 illustrates an example of a operator that transforms input view objects to an output view object in accordance with a preferred embodiment of the present invention;
FIG. 7 illustrates an example dataflow hypergraph distributed over multiple brokers in accordance with a preferred embodiment of the present invention;
FIG. 8 depicts a process for deploying transform objects and view objects when a dataflow specification is a declarative specification in accordance with a preferred embodiment of the present invention;
FIG. 9 illustrates a redeployment mechanism in a distributed messaging system in accordance with a preferred embodiment of the present invention; and
FIG. 10 is a flowchart illustrating the operation of a moving a message transform from a source broker to a target broker in a distributed messaging system in accordance with a preferred embodiment of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT The present invention provides a method, apparatus and computer program product for continuous feedback-controlled deployment of message transforms in a distributed messaging system. The data processing device may be a stand-alone computing device or may be a distributed data processing system in which multiple computing devices are utilized to perform various aspects of the present invention. Therefore, the followingFIGS. 1-3 are provided as exemplary diagrams of data processing environments in which the present invention may be implemented. It should be appreciated thatFIGS. 1-3 are only exemplary and are not intended to assert or imply any limitation with regard to the environments in which the present invention may be implemented. Many modifications to the depicted environments may be made without departing from the spirit and scope of the present invention.
With reference now to the figures,FIG. 1 depicts a pictorial representation of a network of data processing systems in which the present invention may be implemented. Networkdata processing system100 is a network of computers in which the present invention may be implemented. Networkdata processing system100 contains anetwork102, which is the medium used to provide communications links between various devices and computers connected together within networkdata processing system100.Network102 may include connections, such as wire, wireless communication links, or fiber optic cables.
In the depicted example,servers112,114,116 are connected to network102 along withstorage unit106. In addition,clients122,124, and126 are connected to network102. Theseclients122,124, and126 may be, for example, personal computers or network computers. In the depicted example,servers112,114,116 provide data, such as boot files, operating system images, and applications toclients122,124,126.Clients122,124, and126 are clients toservers112,114,116. Networkdata processing system100 may include additional servers, clients, and other devices not shown.
In accordance with a preferred embodiment of the present invention, networkdata processing system100 provides a distributed messaging system that supports stateful subscriptions. A subset ofclients122,124,126 may be publishing clients, while others ofclients122,124,126 may be subscribing clients, for example. Published events may also be generated by one or more ofservers112,114,116. Any computer may be both a publisher and a subscriber.
A stateful publish-subscribe system is a distributed messaging system in which at least one subscription is stateful. Other subscriptions may be content-based or, in other words, stateless. In other words, a stateful publish-subscribe system must compute information that requires multiple messages of one or more streams. For example, a stateful subscription may request, “Give me the highest quote within each one-minute period.” A stateful subscription may entail delivering information other than simply a copy of the published messages. For example, a stateful subscription may request, “Tell me how many stocks fell during each one-minute period.”
The stateful publish-subscribe system is implemented within an overlay network, which is a collection of service machines, referred to as brokers, that accept messages from publisher clients, deliver subscribed information to subscriber clients, and route information between publishers and subscribers. One or more ofservers112,114,116, for example, may be broker machines.
Both content-based and stateful publish-subscribe systems support a message delivery model based on two roles: (1) publishers produce information in the form of streams of structured messages; and, (2) subscribers specify in advance what kinds of information in which they are interested. As messages are later published, relevant information is delivered in a timely fashion to subscribers.
Content-based subscriptions are restricted to Boolean filter predicates that can only refer to fields in individual messages. For example, a content-based subscription may request, “Deliver message if traded volume>1000 shares.” On the other hand, stateful subscriptions are more general state-valued expressions and may refer to one or more messages, either by referring to multiple messages of a single message stream or by referring to multiple message streams or both. In a content-based publish-subscribe system, because subscriptions can only specify filtering, all published messages are either passed through to subscribers or filtered out. Therefore, messages received by subscribers are identically structured copies of messages published by publishers. In contrast, in a stateful publish-subscribe system, subscriptions may include more complex expressions and, therefore, subscribers may receive information that is not identical to the published messages with different formatting. For example, a published message may have only integer prices, while subscriptions to average prices may have non-integer averages.
Published message streams are associated with topics. Each topic is associated with a base relation. A base relation is a table of tuples, each tuple corresponding to an event in the particular message stream. Subscriptions are expressed as view expressions in a relational algebraic language, although other representations may be used, such as eXtensible Markup Language (XML), for example. The language defines a cascade of views of base relations and derived views computed from either base relations or other views. At compile-time, the set of subscriptions is compiled into a collection of objects that are deployed and integrated into messaging brokers. At run-time, publishers and subscribers connect to these brokers. Published events are delivered to objects associated with base relations. The events are then pushed downstream to other objects that compute how each derived view changes based on the change to the base relation. Those derived views associated with subscriptions then deliver events to the subscriber informing the subscriber of each change in state.
In the depicted example, networkdata processing system100 is the Internet withnetwork102 representing a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, government, educational and other computer systems that route data and messages. Of course, networkdata processing system100 also may be implemented as a number of different types of networks, such as for example, an intranet, a local area network (LAN), or a wide area network (WAN).FIG. 1 is intended as an example, and not as an architectural limitation for the present invention.
Referring toFIG. 2, a block diagram of a data processing system that may be implemented as a server, such as server104 inFIG. 1, is depicted in accordance with a preferred embodiment of the present invention.Data processing system200 may be a symmetric multiprocessor (SMP) system including a plurality ofprocessors202 and204 connected tosystem bus206. Alternatively, a single processor system may be employed. Also connected tosystem bus206 is memory controller/cache208, which provides an interface tolocal memory209. I/O bus bridge210 is connected tosystem bus206 and provides an interface to I/O bus212. Memory controller/cache208 and I/O bus bridge210 may be integrated as depicted.
Peripheral component interconnect (PCI)bus bridge214 connected to I/O bus212 provides an interface to PCIlocal bus216. A number of modems may be connected to PCIlocal bus216. Typical PCI bus implementations will support four PCI expansion slots or add-in connectors. Communications links to clients108-112 inFIG. 1 may be provided throughmodem218 andnetwork adapter220 connected to PCIlocal bus216 through add-in connectors.
AdditionalPCI bus bridges222 and224 provide interfaces for additional PCIlocal buses226 and228, from which additional modems or network adapters may be supported. In this manner,data processing system200 allows connections to multiple network computers. A memory-mappedgraphics adapter230 andhard disk232 may also be connected to I/O bus212 as depicted, either directly or indirectly.
Those of ordinary skill in the art will appreciate that the hardware depicted inFIG. 2 may vary. For example, other peripheral devices, such as optical disk drives and the like, also may be used in addition to or in place of the hardware depicted. The depicted example is not meant to imply architectural limitations with respect to the present invention. The data processing system depicted inFIG. 2 may be, for example, an IBM eServer™ pSeries® system, a product of International Business Machines Corporation in Armonk, N.Y., running the Advanced Interactive Executive (AIX) operating system or LINUX operating system.
With reference now toFIG. 3, a block diagram of a data processing that may serve as a client of a service in accordance with a preferred embodiment of the present invention.Data processing system300 is an example of a computer, such as client108 inFIG. 1, in which code or instructions implementing the processes of the present invention may be located. In the depicted example,data processing system300 employs a hub architecture including a north bridge and memory controller hub (MCH)308 and a south bridge and input/output (I/O) controller hub (ICH)310.Processor302,main memory304, andgraphics processor318 are connected toMCH308.Graphics processor318 may be connected to the MCH through an accelerated graphics port (AGP), for example.
In the depicted example, local area network (LAN)adapter312,audio adapter316, keyboard andmouse adapter320,modem322, read only memory (ROM)324, hard disk drive (HDD)326, CD-ROM driver330, universal serial bus (USB) ports andother communications ports332, and PCI/PCIe devices334 may be connected toICH310. PCI/PCIe devices may include, for example, Ethernet adapters, add-in cards, PC cards for notebook computers, etc. PCI uses a cardbus controller, while PCIe does not.ROM324 may be, for example, a flash binary input/output system (BIOS).Hard disk drive326 and CD-ROM drive330 may use, for example, an integrated drive electronics (IDE) or serial advanced technology attachment (SATA) interface. A super I/O (SIO)device336 may be connected toICH310.
An operating system runs onprocessor302 and is used to coordinate and provide control of various components withindata processing system300 inFIG. 3. The operating system may be a commercially available operating system such as Windows XP™, which is available from Microsoft Corporation. An object oriented programming system such as Java may run in conjunction with the operating system and provides calls to the operating system from Java programs or applications executing ondata processing system300. “JAVA” is a trademark of Sun Microsystems, Inc. Instructions for the operating system, the object-oriented programming system, and applications or programs are located on storage devices, such ashard disk drive326, and may be loaded intomain memory304 for execution byprocessor302. The processes of the present invention are performed byprocessor302 using computer implemented instructions, which may be located in a memory such as, for example,main memory304,memory324, or in one or moreperipheral devices326 and330.
Those of ordinary skill in the art will appreciate that the hardware inFIG. 3 may vary depending on the implementation. Other internal hardware or peripheral devices, such as flash memory, equivalent non-volatile memory, or optical disk drives and the like, may be used in addition to or in place of the hardware depicted inFIG. 3. Also, the processes of the present invention may be applied to a multiprocessor data processing system.
For example,data processing system300 may be a personal digital assistant (PDA), which is configured with flash memory to provide non-volatile memory for storing operating system files and/or user-generated data. The depicted example inFIG. 3 and above-described examples are not meant to imply architectural limitations. For example,data processing system300 also may be a tablet computer, laptop computer, or telephone device in addition to taking the form of a PDA.
In accordance with a preferred embodiment of the present invention, a plurality of broker machines are responsible for delivery of message sent by publishing clients towards subscribing clients based upon the content of the messages and the stateful transformations requested by the subscribing clients. These broker machines form an overlay network. Some broker machines may be specialized for hosting publishing clients, referred to as publisher hosting brokers (PHB), and others for hosting subscribing clients, referred to as subscriber hosting brokers (SHB). Between the PHBs and the SHBs, there may be any number of intermediate nodes that include routing and filtering. The brokers at the intermediate nodes are referred to as intermediate brokers or IBs. For expository purposes, this separation is assumed; however, in actual deployment, some or all of the broker machines may combine the functions of PHB, SHB, and/or IB.
FIG. 4 illustrates a broker network for a publish-subscribe system in accordance with a preferred embodiment of the present invention. A publishing client, such as one of publishers402a-402d, establishes a connection to a PHB, such asPHB404aorPHB404b, over a corresponding one of client connections406a-406d. The client connection may be, for example, any reliable synchronous or asynchronous first-in/first-out (FIFO) connection, such as a Transmission Control Protocol/Internet Protocol (TCP/IP) socket connection. Independently, a subscribing client, such as one of subscribers412a-412d, establishes a connection to a SHB, such asSHB410aorSHB410b, over a corresponding one of client connections414a-414d, which may be similar to client connections406a-406d. The PHBs and SHBs are connected, via intermediate brokers408a-408b, through broker-to-broker links.
The publish-subscribe system of the present invention may include a fault-tolerant protocol that tolerates link failures and message re-orderings, in which case it is not necessary for the broker-to-broker connections to use reliable FIFO protocols, such as TCP/IP, but may advantageously use faster, less reliable protocols. Each broker machine may be a stand-alone computer, a process within a computer, or, to minimize delay due to failures, a cluster of redundant processes within multiple computers. Similarly, the links may be simple socket connections, or connection bundles that use multiple alternative paths for high availability and load balancing.
FIG. 5 illustrates how a stateful publish-subscribe service of the present invention appears to clients. Clients are unaware of the physical broker network or its topology. A client application may connect to any broker in the role of publisher and/or subscriber. Publishing clients are aware only of particular named published message streams, such as message streams502,504. Multiple clients may publish to the same message stream.
Administrators and clients may defined derived views based on functions of either published message streams or of other derived views. In the depicted example, message streams may be represented as relations. Derived views are represented as relations derived from published message streams or from other derived views by means of relational algebraic expressions in a language, such as Date and Darwen's Tutorial-D, Structured Query Language (SQL), or XQUERY. For example, derivedview510 is defined as a function ofstream relations502 and504 by means of a JOIN expression withrelations502 and504 as inputs andrelation510 as an output. Similarly,relation512, indicated as a subscriber view, is derived fromrelation510 by client-specified relational expressions. For example,subscriber view512 may be a request to group the stock trades ofrelation510 by issue and hour and compute the running total volume and max and min price for each issue-hour pair.
Each subscribing client subscribes to one or more particular derived views. As published events enter the system from publishing clients, they are saved in their respective streams. The system is then responsible for updating each derived view according to the previously specified relational expressions and then continuously delivering messages to each subscribing client representing the changes to the state of the respective subscribed view.
As used herein, the term “message transform” denotes a software component, such as a Java™ programming language object instance, which takes as inputs one or more sequences of messages and produces as output one or more sequences of messages. In typical use, a message transform may keep some local state and is repeatedly triggered upon receipt of a new message (or a batch of new messages) on one of its inputs. The message transform then performs some computation the effect of which is to possibly modify its state and to possibly generate one or more new messages on one or more of its output sequences. By way of example, the input messages of a transform may represent published events, such as stock trades, and the output messages of the transform might represent changes to a table of total stock volume per day grouped by stock issue. The function of the transform is to compute the change to the total in the appropriate row based upon the current state whenever a new published event arrives.
A sequence processing graph consists of a multiplicity of such message transforms connected together into a flow graph such that the output sequence produced by one message transform is fed into the instance of another transform. On a distributed network, it is possible for some of the transforms in the sequence processing graph to be deployed on different machines. In the system of the present invention, subscribers are interested in the results of one or more functions computed over one or more published sequences (subscriptions). Thus, the system may compute and propagate many message sequences concurrently.
In a preferred embodiment of the present invention, subscription specifications are analyzed by a compiler and converted into a collection of message transform objects and view objects. Each operator that derives a view from one or more inputs corresponds to a one or more transform objects. Each view corresponds to a view object. View objects hold the state of a view. Transform objects express the logic for incrementally updating an output view constituting the result of an operator in response to individual changes to input views constituting the arguments to that operator. Without a compiler, the transform objects can be specified directly. The invention does not depend on the method of transform specification.
FIG. 6 illustrates an example of an operator that transforms input view objects to an output view object in accordance with a preferred embodiment of the present invention. In the depicted example, views610 and620 are view objects that are inputs to some operator, such as, for example, a JOIN operator.Transform650 is a transform object for that operator, which produces a derived view shown asview object670. When one of the input objects changes, either because it itself is a published input stream or because it is a derived view that has changed as a result of changes to its inputs, messages reflecting the changes are sent to transformobject650.Transform650 receives the messages representing changes to itsinputs610,620, computes how the result of the operator changes given the announced changes it has received, and then delivers the computed results to itsoutput view object670 in the form of change messages.Output view object670 then propagates in its turn such change messages, either to further transforms, ifview object670 is an intermediate view, or to subscribers, ifview object670 is a subscriber view.
FIG. 6 illustrates the objects and message pathways for a single transform implementing a single computational operation. When subscriptions are entered, the mechanism of the present invention builds a structure containing all of the transform objects and view objects needed for all intermediate and subscribed views of all subscriptions. This structure is called a dataflow hypergraph. The dataflow hypergraph has nodes corresponding to each view object and hyperedges, which may possibly have more than one input feeding an output, representing each transform object associated with an operation in the subscription specification.
The view objects and transform objects are then allocated to actual brokers in the overlay network, either manually by an administrator or automatically via a continuous deployment service as described herein. The published streams and the subscribed views may be constrained to be located on brokers where the publishers and subscribers actually connect. The placement of the intermediate transform objects and view objects is not constrained. That is intermediate transform objects and view objects may be placed wherever suitable, taking into consideration the capacities of the broker machines and the links, as well as the desired performance. After such allocation of objects to brokers, the result is a distributed transform graph.
FIG. 7 illustrates an example dataflow hypergraph distributed over multiple brokers in accordance with a preferred embodiment of the present invention. In the depicted example, the physical network consists ofbrokers710,720,730, and740. There are three publishingclients702,704, and706, and one subscribingclient750. The publishing clients are publishing to three separate published message streams: “buys”722 onbroker720, “sells”734 onbroker730, and “matches”712 onbroker710. The subscribingclient750 subscribes to derivedview748 onbroker740.
Broker710 also includestransforms714 and716, which feed change messages tobrokers720 and730, respectively.Broker720 includes view objects724,726 and transformobjects725,727. As an example,view object726 represents an intermediate derived view or relation, which is based ontransform725, publishedstream722, andview724.Broker730 includesviews732 and736, in addition to publishedstream734, and also includestransforms735,737.Broker740 includesviews742,744,748, and transform746. View748 is a subscriber view forsubscriber750.
As shown inFIG. 7, the transform graph consists of multiple transform and view objects distributed over all brokers. The paths between objects will sometimes lie within a broker, as is the case betweentransform object725 andintermediate view object726. In other cases, such as the path betweentransform727 and intermediate view object742 (shown with a dotted line), the path must cross over an inter-broker link. It is clear to those versed in the art that the within-broker communications between objects may use cheaper communications means, such as parameter passing between objects, whereas inter-broker communications requires generating physical messages or packets that will cross the link. In a preferred embodiment, the protocols of all view objects and transform objects will be able to recover from lost, out-of-order, or duplicate messages, and, therefore, will work correctly regardless of which paths between objects cross broker boundaries and which do not.
In order to support stateful subscriptions, a history of states is stored in a data storage device. For example, messages from the “matches” publishedstream712 are stored instorage782, messages from the “buys” publishedstream722 are stored instorage784, and messages from the “sells” publishedstream734 are stored instorage786.Storage782,784,786 may be a portion of system memory or may be a persistent storage, such as a hard drive, or a combination thereof. Messages may be stored as records in a table, spreadsheet, or database, for example. While the example shown inFIG. 7 stores published messages in storage, other changes to derived views may also be stored in history storage. For example, changes toviews742,744 may be stored in history storage for use bytransform746.
To compute and deliver message sequences to subscribers efficiently, the transforms involved in the computation must be placed on the nodes of the broker network intelligently in order to maximize some measure of performance. In accordance with a preferred embodiment of the present invention, the placement of transforms is performed continuously in response to system state changes, resource availability, and client demands. The system monitors itself using a set of agents that submit performance feedback information to the performance optimization service that uses the data to issue new transform deployment instructions to the broker network. Regardless of the placement algorithm, such a closed-loop system may sometimes decide to move some transforms from resource to resource to improve performance.
Designs with various types of dynamic process migration and task assignment for mainframe, distributed, and parallel systems are known in the art. These designs often deal with tasks that exchange data rarely, such as at an end of a computational block. In a stream-processing system the tasks performing the transforms exchange data very often. A transform object continuously receives message sequence data from its parent transform objects and sends the derived data to its children. If a transform needs to be redeployed to another network node, it should be done in a way that causes minimal disruption to the subscriptions that use it. The problem of redeploying a transform at runtime is analogous to changing a segment of a water pipe without interrupting the water flow and spilling as little water as possible.
FIG. 8 depicts a process for deploying transform objects and view objects when a dataflow specification is a declarative specification in accordance with a preferred embodiment of the present invention. A declarative specification is a program, such asmiddleware program source802.Program source802 may be written in a language similar to the SQL syntax, for example.Program source802 may be compiled usingcompiler804 into a dataflow hypergraph.Compiler804 includes particular algorithms for compiling relational algebraic operators into tailored object code that implements transform objects specialized to incrementally evaluate operators, such as join, select, aggregate, etc., for relations of known data types and key signatures. The object code may be, for example, Java™ programming language code.Compiler804 represents values of derived states as monotonic knowledge. A monotonic state can change value only in one direction. Even state values that oscillate back and forth are represented as the difference between two monotonically increasing values. Hence, the system can always detect which of two values of a single data field is older. As another example, the system distinguishes internally between a “missing value,” which is missing because its value is not yet known, and one that is missing because it is known to be absent. As yet another example, the system distinguishes between a field having a value that is currently ten, but which may get larger later, from a field whose value is currently ten, but is final and will never change.
Compiler804 generates an object for each relation and an object for each transform. Each algebraic operation with n operands is associated with one transform object with n input interfaces and one relation object with an output interface. The compiler may then generates a script that is executed at deployment time to instantiate all transform and relation objects, connecting an instance of the output interface from each relation to an instance of the input to each transform that uses the relation as an input. Base relations are fed from an “input transform” object, which delivers published messages to the relation. The objects form the knowledge flow graph, or hypergraph described earlier. Each transform in the hypergraph, such as join, project, group-by, select, etc., is associated with a template used by the compiler to produce an incremental transform. As discussed above, a transform has one or more relations feeding inputs to it and a single relation receiving its output. The incremental transform is an object that, given a message indicating that one component of its input has changed, computes the appropriate changes to its output and invokes a method in the output relation indicating what has changed.
The hypergraph may be specified explicitly or, alternatively, the hypergraph may be optimized usingtransform placement service806 and deployed usingdeployment service808. The hypergraph is optimized and deployed tobroker network824 by passingbroker instructions812 tobroker network824 and receivingperformance information814 frombroker network824. An optimization step consolidates multiple subscriptions to exploit common computations and performs other simplifications to reduce the total computational load on the system. A placement step allocates the transform objects of the knowledge flow graph to particular brokers for the purpose of load balancing, reduction of bandwidth, increasing update throughput and overall reduction of delays between publishers and subscribers.
Broker network824 receives the knowledge flow graph at deployment time. At execution time,publishers822 publish message streams tobroker network824. The brokers receive events and propagate messages representing view changes towards the subscribers.Subscribers828 may then receive messages from subscriber views inbroker network824.
FIG. 9 illustrates a redeployment mechanism in a distributed messaging system in accordance with a preferred embodiment of the present invention. The continuous feedback-controlled deployment system of the present invention includesdeployment system910 that has three components.Monitoring services component912 detects the possible need to reconfigure. Performanceoptimization services component914 computes a new assignment of transforms. Continuousdeployment services component916 implements a redeployment protocol that installs the changes to transform placement while the existing publish-subscribe system continues to operate.
An efficient stream transform redeployment algorithm addresses the problem of disruption that happens when a transform needs to be moved from one hosting resource to another. Since transforms continuously exchange message sequence data, there is a high probability of having messages in the transform-to-transform links during the time of the move. Furthermore, new messages are likely to appear at the moving transform inputs. If the transform is absent, the effects of these messages would not reach the subscribers and cause information gaps and delays.
The continuous transform redeployment service of the present invention limits the disruption in the message sequences as perceived by its subscribers while one or more of the subscription transforms are being redeployed. The redeployment algorithm does this by ensuring that each message sequence continues to reach its subscribers without interruption. The redeployment system may temporarily deal with message sequences through redundant paths during periods of switchover.
The continuous deployment service of the present invention in the computing middleware works in two phases. The first phase is initial deployment of the service itself. The second phase is transform deployment instructed byperformance optimization service916. In the first phase, a program (agent) is installed on each computer (broker) that needs to participate in the computing middleware. In the example depicted inFIG. 9,agent922 is installed onbroker920,agent932 is installed onbroker930, andagent942 is installed onbroker940.
Each agent turns the computer into a network broker and has several capabilities: a) the capability to start/stop a particular transform in the broker program; b) the capability to receive a transform assignment fromcontinuous deployment service916; c) the capability to copy a transform state to an agent on another machine; d) the capability to receive/send instructions to establish/break connections between its broker program and other broker programs; e) the capability to monitor the broker program and machine performance and send the statistics tomonitoring service912; f) the capability to register its network broker as a resource withdeployment system910 so that the broker can be used for computation and message routing; g) the capability to perform diagnostics of the network broker it manages to ensure that the broker works properly; and, h) the capability to request a diagnostic report from another agent about its broker.
Each agent enters the second phase of the continuous deployment service as soon as its registration withdeployment system910 is complete. In this phase, the agent is always ready to execute the steps of the efficient stream transform redeployment algorithm discussed above, when instructed bycontinuous deployment service916. The efficient stream transform redeployment algorithm is as follows:
Assuming task T needs to be moved from network broker A to network broker B,
- 1. Instantiate a copy of T on system B;
- 2. Duplicate all inputs into A to additionally go to B;
- 3. Copy T's state from A to B;
- 4. Duplicate all outputs from A to also go from B to the same children;
- 5. Terminate T on A when B passes self-diagnostics and works properly.
To performtask 1 of the algorithm, the agent associated with network broker B receives the program for transform T either from its local storage or fromdeployment system910 or a delegate thereof. The agent then instructs the broker to load the program and prepare to execute the transform.
To performtask 2 of the algorithm, the agent associated with network broker A sends instructions to the agent associated with network broker B about the connections network broker B would need to establish as inputs. Once the instructions are received by the agent associated with broker B, the agent instructs the broker to initialize the connections and ensure that the messages start flowing into B using these connections.
To performtask 3 of the algorithm, the agent associated with network broker A sends the copy of the state of transform T to the agent associated with network broker B. This state is a tuple that contains all of the information necessary for T to continue being evaluated properly at broker B as if it were at broker A, as well as the sequence marker indicating which last message in the sequence was processed to compute this state. The latter is necessary in order for the broker B to know what messages it can use to compute a next correct version of the state. To complete the task the agent associated with broker B loads the received state of T into the broker and ensures that T functions correctly with new state.
To perform task 4 of the algorithm, the agent associated with network broker A sends instructions to the agent associated with network broker B about the connections network broker B would need to establish as outputs. Once the instructions are received by the agent associated with broker B, the agent instructs the broker to initialize the connections and ensure that the messages start flowing out of B using these connections.
To perform task 5 of the algorithm, the agent associated with network broker A requests a diagnostic report from the agent associated with broker B. Once the agent associated with broker A receives a satisfactory diagnostic report, the agent associated with network broker A stops the transform T in its broker program.
FIG. 10 is a flowchart illustrating the operation of a moving a message transform from a source broker to a target broker in a distributed messaging system in accordance with a preferred embodiment of the present invention. The process begins and instantiates a copy of the transform on the target broker (block1002). Then, the process duplicates all of the inputs of the source broker to the target broker (block1004). The process then copies the state of the transform from the source broker to the target broker (block1006). Next, the process duplicates all of the outputs of the source broker from the target broker to the same children (block1008). The process terminates the transform on the source broker when the target broker passes a self-diagnostic and works properly (block1010). Thereafter, the process ends.
The stateful publish-subscribe system of the present invention is able to be deployed on a wide-area, distributed overlay network of broker machines, communicating by message passing, and being subjected to the possibility of duplicate and out-of-order messages between links. The distributed messaging system of the present invention allows subscriptions expressed as relational algebraic expressions on published message histories. Relational algebra may include, in particular, operators such as select, project, join, extend, group-by, sum, count, and average, for example. The relational algebraic expressions may be mapped to form various query languages, such as SQL and XQUERY. Furthermore, the messaging system of the present invention may allow service specifications that are deterministic and “eventually consistent,” meaning that multiple identical subscriptions eventually receive the same result, but weaker, and hence cheaper to implement, than fully consistent database systems.
It is important to note that while the present invention has been described in the context of a fully functioning data processing system, those of ordinary skill in the art will appreciate that the processes of the present invention are capable of being distributed in the form of a computer readable medium of instructions and a variety of forms and that the present invention applies equally regardless of the particular type of signal bearing media actually used to carry out the distribution. Examples of computer readable media include recordable-type media, such as a floppy disk, a hard disk drive, a RAM, CD-ROMs, DVD-ROMs, and transmission-type media, such as digital and analog communications links, wired or wireless communications links using transmission forms, such as, for example, radio frequency and light wave transmissions. The computer readable media may take the form of coded formats that are decoded for actual use in a particular data processing system.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.