HK1099817B

Movatterモバイル変換

Info

Publication number: HK1099817B
Application number: HK07105644.2A
Authority: HK
Inventors: 史蒂文．W．罗斯
Original assignee: 交互式内容引擎有限责任公司
Priority date: 2003-12-02
Filing date: 2004-12-02
Publication date: 2009-04-30

Description

Synchronous data transmission system and method

Cross reference to related applications

This application claims the benefits of U.S. provisional application No.60/526,437, filed 12/02/2003, and is entitled "interactive broadband server system," serial No.10/304,378, filed 11/26/2002, continuation of the section of the pending U.S. patent application, which itself claims priority to U.S. provisional application No.60/333,856, filed 11/28/2001, all of which are co-inventors, commonly assigned, and incorporated herein by reference for all purposes and purposes.

Technical Field

The present invention relates to interactive broadband server systems, and more particularly, to an interactive content engine that employs a synchronous data delivery system to facilitate the delivery of multiple simultaneous isochronous (isochronous) data streams at high speeds.

Background

It is desirable to provide a solution for storage and delivery of streaming media content. The initial goal of scalability is to stream from 100 to 1,000,000 simultaneous isochronous content at 4 megabits per second (Mbps) per stream, although different data rates are contemplated. The total bandwidth available is limited by the maximum available backplane (backplane) switch. Today's maximum switches are in the terabit per second range, or about 200,000 simultaneous output streams. The number of output streams is generally inversely proportional to the bit rate per stream.

The simplest model of content storage is a single disk drive connected to a single processor with a single network connector. Data is read from disk, placed in memory, and distributed in packets to each user over the network. Traditional data, such as Web pages, may be transferred asynchronously. In other words, there is random amount data with random time delay. Low-volume, low-resolution video may be transmitted from a Web server. Real-time media content, such as video and audio, requires isochronous transmission, or transmission with guaranteed delivery time. In this case, bandwidth constraints exist at the disk drive. The disc has its claimed arm movement and rotational wait. If the system can only maintain 6 simultaneous continuous streams of content from the drive to the processor at any given time, then the 7 th user's request must wait for one of the previous 6 users to terminate the stream of content. The advantage of this design is simplicity. The disadvantage is that the only mechanism in the design is to be able to access and transfer data only so quickly.

Improvements can be made by adding another drive, or drives, and interleaving drive accesses. Moreover, duplicate content may be stored on each drive to increase redundancy and performance. This is better, but still has several problems. Only so much content can be placed on the local drive or drives. The disk drive, CPU, and memory are each single point of failure that can be catastrophic. The size of such a system can only be proportional to the number of drives that can be handled by the disk controller. Even for multiple units, there is a problem with title allocation. In the real world, everyone wants to see the latest movies. Empirically, 80% of content requests are for 20% of titles only. The bandwidth of all machines cannot be exhausted by one title because it would block access to less popular titles that are stored only on that machine. As a result, the "highly needed" title must be loaded on most or all machines. Simply put, if a user wants to see an old movie, the user may not be fortunate-although the movie is loaded in the system. For larger libraries, the ratio may be much larger than the 80/20 standard used in this example.

There are other drawbacks if the system is based on a standard Local Area Network (LAN) used in data processing. Modern ethernet-based TCP/IP systems are a singular source of guaranteed delivery, but include the time penalty incurred by packet collisions and retransmission of partially lost packets and the management required to make it fully operational. There is no guarantee that a timely set of content streams is available. Also, each user consumes one switch port, and each content server consumes one switch port. Thus, the switch port count must be twice the server count, limiting the total on-line bandwidth.

Disclosure of Invention

The present invention has been made to solve the above-mentioned problems occurring in the prior art.

The invention provides a synchronous data transmission system, comprising: a plurality of processor nodes; a backbone network switch coupled to the plurality of processor nodes to enable communication between the plurality of processor nodes; a plurality of storage devices distributed across the plurality of processor nodes and storing a plurality of titles, each title divided into a plurality of sub-blocks distributed across the plurality of storage devices; a plurality of transfer processes, each executing on a corresponding one of the plurality of processor nodes and operable to send a message for each sub-block to be transferred from the local storage to a destination processor node, each message including a source node identifier and a destination node identifier; and a synchronous switch manager process executing on at least one of the plurality of processor nodes, the process periodically broadcasting a transmit command to initiate each of a plurality of sequential transmit periods, the process receiving a plurality of messages and prior to each transmit period, the process selecting from the plurality of messages to ensure that each processing node transmits up to one sub-block and receives up to one sub-block during each transmit period, and transmits a plurality of transmit requests corresponding to the selected messages; and wherein each transfer process that sends at least one message and received a transfer request from the Sync switch manager process that identified a corresponding sub-block, sends the corresponding sub-block to the processor node identified by the destination node identifier during a next transfer period initiated by a broadcast transfer command.

The present invention also contemplates a method of synchronizing distributed sub-blocks of data transferred between a plurality of processor nodes coupled to a network switch, the method comprising: periodically broadcasting, by a management process executing on at least one of the plurality of processor nodes, a transmit command to initiate each of a plurality of sequential transmit periods; sending, by each processor node having at least one sub-block to be sent, a message to the management process for each sub-block to be sent, each message identifying a source processor node and a destination processor node; selecting, by the management process, messages received from processor nodes to ensure that each processor node transmits up to one sub-block during a subsequent transfer period and that each processor node receives up to one sub-block during the subsequent transfer period; sending, by the management process, a plurality of transfer requests, each transfer request being sent to a processing node that has sent the selected corresponding message; and transmitting, by each processor node receiving the transmission request, the sub-block identified by the received transmission request to the destination processor node in response to a subsequent transmission command.

The invention also provides a synchronous data transmission system, comprising: a plurality of storage processor nodes including first and second user nodes and a management node; a backbone communication switch coupled to the plurality of storage processor nodes; a plurality of headers each divided into a plurality of sub-blocks distributed across the plurality of storage processor nodes; a user process executing on the first user node, the user process sending a plurality of timestamp read requests, each timestamp read request to request a corresponding sub-block; a transfer process, executed on the second user node, that sends a message for each received timestamp read request requesting a locally stored sub-block, each message comprising a source node identifier and a destination node identifier; and a management process, executed on the management node, that periodically broadcasts a transfer command via the switch to initiate each of a plurality of sequential transfer periods, that receives a plurality of messages, and that selects from the plurality of messages prior to each transfer period to ensure that each storage processing node transmits up to one sub-block and receives up to one sub-block during each transfer period, and transmits a plurality of transfer requests corresponding to the selected messages; and wherein the transfer process, in response to receiving a transfer request from the management process, sends the corresponding sub-block to the storage processor node identified by the destination node identifier during a next transfer period initiated by a next broadcast transfer command.

Drawings

The benefits, features, and advantages of the present invention will become better understood with regard to the following description and accompanying drawings where:

FIG. 1 is a simplified block diagram of a portion of an Interactive Content Engine (ICE) implemented according to an exemplary embodiment of the present invention; and

FIG. 2 is a logical block diagram of a portion of the ICE of FIG. 1, illustrating a synchronous data transfer system implemented according to an embodiment of the present invention.

Detailed Description

The following description is presented to enable any person skilled in the art to make and use the invention as provided within the context of a particular application and its requirements. Various modifications to the preferred embodiment will, however, be apparent to one skilled in the art, and the general principles defined herein may be applied to other embodiments. Thus, the present invention is not intended to be limited to the particular embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The architecture described herein accommodates elements of varying capabilities to avoid limiting the facility to the point in time when the initial system purchase is made. The use of commodity elements ensures recently proven technology, avoidance of single sources, and lowest cost per stream. Allowing individual components to fail. In many cases, there is no significant change in performance from the user's perspective. In other cases, there is a brief "self-repair" cycle. In many cases, multiple failures may be tolerated. Moreover, in most, if not all, cases, the system can recover without requiring immediate attention, making it ideal for "lights out" operation.

The content storage allocation and internal bandwidth are automatically managed by a Least Recently Used (LRU) algorithm that ensures that the content in the RAM cache and the hard drive array cache is appropriate for the current needs, and the backplane switch bandwidth is used in a most efficient manner. The bandwidth within the system is rarely, if ever, oversubscribed so there is no need to drop or delay the transmission of packets. This architecture provides the ability to fully utilize the composite bandwidth of each element so guarantees can be met, and the network is proprietary and under full control so no data path is overloaded even in the case of unexpected peak demand. Any bit rate stream can be accommodated, but a typical stream is expected to remain in the 1 to 20Mbps range. Non-isochronous content is accommodated on the available bandwidth basis. Bandwidth may be reserved deliberately if required by the application. The file may be of any size that maximizes storage efficiency.

FIG. 1 is a simplified block diagram of a portion of an Interactive Content Engine (ICE)100 implemented according to an exemplary embodiment of the present invention. For the sake of clarity, no adequate and complete understanding of the present invention is shown. The ICE100 includes a suitable multi-port (or multi-port) gigabit ethernet (GbE) switch 101 as a backplane fabric having multiple ethernet ports coupled to a plurality of Storage Processor Nodes (SPNs) 103. Each SPN103 is a simplified server including two gigabit ethernet ports, one or more processors 107, memory 109 (e.g., Random Access Memory (RAM)), and an appropriate number (e.g., four to eight) of disk drives 111. The first Gb port 105 on each SPN103 is connected to a corresponding port of the switch 101 for full duplex operation (simultaneous transmission and reception at each SPN/port connection) and is used to move data within the ICE 100. Another Gb port (not shown) delivers content output to the user (not shown).

Each SPN103 has high speed access to its local disk drive and to the other disk drives of the other four SPNs in each five SPN group. The switch 101 is a backplane for the ICE100 and not just a means of communication between the SPNs 103. For illustrative purposes, only five SPNs 103 are shown, with the understanding that the ICE100 typically includes a large number of servers. Each SPN103 functions as a storage, processing, and transmitter of content. In the configuration shown, each SPN103 is configured to use off-the-shelf components and is not a computer in the usual sense. While considering a standard operating system, such interrupt driven operating systems may present unnecessary bottlenecks.

Each title (e.g., video, movie or other media content) is not stored entirely on any single disk drive 111. Instead, the data for each title is divided and stored in several disk drives within the ICE100 to achieve the speed advantage of interleaved access. The content of a single title is distributed across multiple disk drives of multiple SPNs 103. Short "timeframes" of title content are collected from each drive in each SPN103 in a round robin (round robin) fashion. In this manner, the actual load is spread out of the drive count limits of the SCSI and IDE, resulting in a form of fail-safe operation, and organization and management of large numbers of titles.

In the particular construction shown, each content title is divided into fixed-size discrete blocks (typically about 2 Megabits (MB) per block). Each block is stored on a different set of SPNs 103 in a round-robin fashion. Each block is divided into four sub-blocks and a fifth sub-block representing parity is created. Each subchunk is stored on a disk drive of a different SPN 103. In the configuration shown and described, a sub-block size of approximately 512 Kilobytes (KB) (where "K" is 1024) matches each of the nominal data units of disk drive 111. SPNs 103 are grouped five at a time, and each group or set of SPNs stores a data block of a title. As shown, the five SPNs 103 are labeled 1-4 and "Parity," which collectively store the block 113 as five separate sub-blocks 113a, 113b, 113c, 113d, and 113e stored on SPNs 1, 2, 3, 4, and Parity, respectively. The sub-blocks 113a-113e are shown as being stored in a distributed manner on different DRIVEs for each different SPN (e.g., SPN1/DRIVE1, SPN2/DRIVE2, SPN3/DRIVE3, etc.), but may be stored in any other possible combination (e.g., SPN1/DRIVE1, SPN2/DRIVE1, SPN3/DRIVE3, etc.). Sub-blocks 1-4 include data and sub-block Parity includes Parity information for the sub-blocks of data. The size of each set of SPNs, although typically five, is arbitrary and could easily be any other suitable number, like for example 2 SPNs to 10 SPNs. To redundancy two SPNs use 50% of their storage, ten use 10%. Five is a trade-off between storage efficiency and probability of failure.

By distributing the content in this manner, at least two goals are achieved. First, the number of users that can view a single title is not limited to the number that can be served by a single set of SPNs, but rather by the bandwidth of all sets of SPNs put together. Thus, only one copy of each content title is required. The trade-off is a limit on the number of new viewers of a given title that can be created per second, which is a constraint far less than the administrative overhead of wasted space and redundant storage. A second goal is an increase in the overall reliability of the ICE 100. Failure of a single drive is masked by regenerating its contents in real time using parity drives, similar to a Redundant Array of Independent Disks (RAID). The failure of SPN103 is masked by the fact that it contains one drive from each of several RAID sets that each continue to operate. Users connected to a failed SPN are very quickly taken over by shadow (shadow) processes running on other SPNs. In the event of a failure of the disk drive and the entire SPN, the operator is notified to repair or replace the failed device. When the missing subchunk is reestablished by the user process, it is transferred back to the SPN that provided it, where it is cached in RAM (as if it had been read from the local disk drive). This avoids wasting time for other user processes to do the same reconstruction for popular titles, as later requests will fill from RAM as long as the sub-blocks are popular enough to remain cached.

The goal of the User Process (UP) running on each "user" SPN103 is to collect subchunks from its own disk plus the corresponding four subchunks from other user SPNs to assemble a block of video content for transmission. The user SPN is distinct from one or more administrative MGMT SPNs, which are constructed in the same manner but perform different functions, as described further below. A pair of redundant MGMT SPNs is contemplated to improve reliability and performance. The collection and assembly functions performed by each UP are performed multiple times on each user SPN103 for multiple users. As a result, a significant amount of data transfer occurs between the user SPNs 103. Typical ethernet protocols with packet collision detection and retries would otherwise be overwhelmed. Typical protocols are designed for random transmissions and depend on the idle time between those events. This approach is not used. In the ICE100, collisions are avoided by using full duplex, full switching architecture, and by carefully managing bandwidth. Most communications occur synchronously. The switch 101 itself is managed in a synchronized manner, as described further below, to coordinate the transmission. Because it is determined which SPN103 is transmitting and when a port is not overwhelmed by more data than it can handle within a given period of time. Indeed, the data is first collected in the memory 109 of the user SPN103, and then its transfer is managed synchronously. As part of the coordination, there is a status signal between the user SPNs 103. Unlike the actual content to the end user, the data size used to send signaling between the user SPN units is small.

If the transmission of the sub-blocks is allowed to occur randomly or asynchronously, the length of each sub-block (approximately 512 kbytes, where "K" is 1024) would otherwise overwhelm any buffer available in GbE switch 101. The time period for transmitting such much information is about 4 milliseconds (ms), and it is desirable to ensure that several ports do not attempt to transmit to a single port at the same time. Thus, as described further below, switch 101 is managed in a manner such that it operates synchronously, with all ports being fully utilized under full load conditions.

The redundant directory process that manages the file system (or virtual file system or VFS) is responsible for reporting where a given content title is stored when it is requested by a user. It is also responsible for allocating the required storage space when loading a new title. All allocations are in a monolithic block, each of which includes five sub-blocks. Space on each disk drive is managed within the drive by Logical Block Addresses (LBAs). The sub-blocks are stored in consecutive sectors or LBA addresses on the disk drive. The capacity of each disk drive in the ICE100 is represented by its maximum LBA address, which is divided by the number of sectors per sub-block.

Each title map or "directory entry" contains a list indicating where the blocks of its title are stored, and more specifically where the sub-blocks of each block are located. In the detailed embodiment, each entry in the list representing a subchunk includes a SPNID identifying a particular user SPN103, a disk drive number (DD #) identifying a particular disk drive 111 of the identified user SPN103, and a subchunk pointer (or logical block address or LBA) packed into a 64-bit value. Each directory entry contains a list of sub-blocks for about half an hour of content at a rated 4 mbps. This is equal to 450 blocks, or 2250 sub-blocks. Each directory entry is about 20KB with auxiliary data. When a UP executing on an SPN requests a directory entry, the entire entry is sent and stored locally for the corresponding user. Even if the SPN supports 1,000 users, only 20MB of memory is consumed for local list or directory entries.

The ICE100 maintains a database of all titles available to the user. This list includes the local disc library, real-time network programming, and titles at remote locations where permission and transport settings have been made. The database contains all metadata for each title including management information (licensing period, bit rate, resolution, etc.) and information of interest to the user (production, director, cast, play, author, etc.). When the user makes a selection, the directory of the Virtual File System (VFS)209 (fig. 2) is queried to determine if the title is already loaded in the disk array. If not, a loading process (not shown) is initiated for the piece of content and the UP is notified when viewable if necessary. In most cases, the delay is no greater than the mechanical delay of the optical disc retrieval machine (not shown), or about 30 seconds.

The information stored on the optical disc (not shown) includes all the metadata that was read into the database when the disc was first loaded into the library, as well as compressed digital video and audio representing the title and all the information that could be gleaned in advance about these data streams. For example, it contains pointers to all relevant information in the data stream, such as clock values and time stamps. It has been divided into sub-blocks with pre-computed and parity sub-blocks stored on disk. Generally, any content that can be pre-processed to save loading and processing overhead is included on the optical disc.

Included in the resource management system is a scheduler (not shown) that the UP consults to receive a start time for its flow (typically within a few milliseconds of the request). The scheduler ensures that the load on the system remains uniform, that latency is minimized, and that the bandwidth required within the ICE100 does not exceed the available bandwidth at all times. When a user requests a stop, pause, fast forward, rewind, or other operation that interrupts the flow of their stream, their bandwidth is reallocated and a new allocation is made for any new service requested (e.g., a fast forward stream).

FIG. 2 is a logical block diagram of a portion of the ICE100 illustrating a synchronous data transfer system 200 implemented according to an embodiment of the present invention. Switch 101 is shown coupled to several exemplary SPNs 103, including a first user SPN 201, a second user SPN 203, and a management (MGMT) SPN 205. As previously mentioned, a plurality of SPNs 103 are coupled to switch 101, and only two user SPNs 201, 203 are represented for purposes of explaining the present invention, and are implemented as virtually any SPN103 as previously described. MGMT SPN 205 is only implemented as virtually any other SPN103, but typically performs administrative functions rather than user-specific functions. SPN 201 indicates certain functions and SPN 203 indicates other functions of each user SPN 103. However, it is to be understood that each user SPN103 is configured to perform similar functions, such that the functions (and processes) described for SPN 201 are also provided at SPN 203, and vice versa.

As previously described, switch 101 operates at 1Gbps per port, so that each subchunk (about 512KB) takes about 4ms to pass from one SPN to another. Each user SPN103 performs one or more User Processes (UPs), each of which is used to support one downstream user. When a new block of a title is needed to refill the user output buffer (not shown), the next five subchunks from the list are requested from the other user SPNs that store those subchunks. Since multiple UPs may request multiple sub-blocks at substantially the same time, the sub-block transmission duration may otherwise overwhelm the buffering capacity of almost any GbE switch for a single port, regardless of the use for the entire switch. This is true for the switch 101 shown. If the sub-block transmission is not managed, then all five sub-blocks for each UP may return at the same time, overwhelming the output port bandwidth. It is desirable to tighten the timing of the transmission of the SPN of the ICE100 so that the most critical data is transmitted first and intact.

SPN 201 is shown to perform UP 207 in order to serve a corresponding downstream user. The user requests a title (e.g., a movie), which is forwarded to UP 207. UP 207 passes the Title Request (TR) to a VFS 209 (described further below) located on MGMT SPN 205. The VFS 209 returns the Directory Entry (DE) to the UP 207, and the UP 207 locally stores the DE as shown at 211. DE 211 includes a list that locates each subchunk (SC1, SC2, etc.) of the title, each entry including an SPNID identifying a particular user SPN103, a disk drive number (DD #) identifying a particular disk drive 111 of the identified SPN103, and an address or LBA identifying the particular location on the disk drive where the subchunk is provided. SPN 201 initiates a timestamp read request (TSRR) one at a time for each subchunk in DE 211. In the ICE100, the request is made immediately and directly. In other words, SPN 201 immediately and directly makes a request for a subchunk to a particular user SPN103 that stores data. In the illustrated configuration, the request is made in the same manner even if stored locally. In other words, even if the requested subchunk resides on the local disk drive of the SPN 201, it sends the request out via the switch 201 as in a remote arrangement. The network is a location that may be configured to recognize that a request is being sent from an SPN to the same SPN. It is simpler to handle all cases as well, especially in larger facilities where the probability that the request is actually local is less.

Although the request is sent out immediately and directly, the subchunks are each returned in a fully managed manner. Each TSRR uses the SPNID to a particular subscriber SPN and includes the DD # and LBA for the target subscriber SPN to retrieve and return data. The TSRR may also include any other identifying information sufficient to ensure that the requested subchunk is properly returned to the appropriate requester and to enable the requester to identify the subchunk (e.g., a UP identifier that distinguishes multiple UPs performed on the destination SPN, a subchunk identifier that distinguishes multiple subchunks for each data block, etc.). Each TSRR also includes a Time Stamp (TS) identifying the specific time when the original request was made. The TS identifies a priority of the request for isochronous transmission purposes, where the priority is based on time such that earlier requests exhibit higher priority. When received, the returned subchunk of the requested title is stored in local title memory 213 for further processing and transmission to the user requesting the title.

The user SPN 203 indicates the operation and support functions of a Transfer Procedure (TP)215 executing on each user SPN (e.g., 201, 203), to receive TSRRs and to return request subchunks. The TP 215 includes or otherwise interfaces with a storage process (not shown) that interfaces with the local disk drive 111 on the SPN 203, which is used to request and access storage subchunks. The storage process may be implemented in any desired manner, such as a state machine or the like, and may be a separate process that interfaces between the TP 215 and the local disk drive 111, as is known to those skilled in the art. As shown, the TP 215 receives one or more TSRRs from one or more UP's executing on other user SPNs 103 and stores each request in a Read Request Queue (RRQ)217 in its local memory 109. RRQ 217 stores a list of requests for sub-blocks SCA, SCB, etc. The disk drive storing the requested sub-blocks removes the corresponding requests from RRQ 217, sorts them in actual order, and then performs each read in sorted order. Access to the sub-blocks on each disk is managed in groups. Each group is sorted in actual order according to the "elevator seek" operation (one scan from low to high, the next scan from high to low, etc., to scan back and forth across the pan head of the pan surface, pausing to read the next sequential sub-block). The successfully read requests are stored in a Successful Read Queue (SRQ)218 sorted in TS order. The requests for failed reads (if any) are stored in a Failed Read Queue (FRQ)220 and the failure information is forwarded to a network management system (not shown) which determines the errors and appropriate corrective action. Note that in the illustrated configuration, queues 217, 218, and 220 store request information rather than actual sub-blocks.

Each sub-block that was successfully read is placed in the memory reserved for the LRU cache of the most recently requested sub-block. For each retrieved subchunk, the TP 215 creates a corresponding Message (MSG) that includes the TS for the subchunk, the Source (SRC) of the subchunk (e.g., the SPNID from which the subchunk is being transmitted and its actual memory location, and any other identifying information), and the Destination (DST) SPN (e.g., SPN 201) to which the subchunk is to be transmitted. As shown, the SRQ 218 includes messages MSGA, MSGB, etc. for the sub-blocks SCA, SCB, etc., respectively. After reading and caching the requested subchunks, TP 215 sends the corresponding MSG to a Synchronous Switch Manager (SSM)219 executing on MGMT SPN 205.

The SSM 219 receives and prioritizes multiple MSGs from the user SPN from the TP and ultimately sends a transmit request (TXR) to the TP 215 that identifies one of the MSGs in its SRQ 218, such as using a message identifier (MSGID). When the SSM 219 sends the TXR with the MSGID identifying the subchunk in the SRQ 218 to the TP 215, the request list moves from the SRQ 218 to a Network Transfer Process (NTP)221 (where "move" indicates that the request is removed from the SRQ 218), which process 221 establishes a packet that is used to transfer the subchunk to the destination user SPN. The order in which the list of subblock requests is removed from the SRQ 218 need not be sequential, although the list is in timestamp order, as only the SSM 219 determines the proper ordering. SSM 219 sends a TXR to each other SPN103 having at least one subchunk to send unless the subchunk is to be sent to a UP on an SPN103 that has scheduled to receive an equal or higher priority subchunk, as described further below. SSM 219 then broadcasts a single transmit command (TX CMD) to all user SPNs 103. TP 215 responds to the TX CMD command broadcast by SSM 219 to instruct NTP 221 to transmit the subchunk to user SPN103 for a request UP. In this manner, each SPN103 that has received TXR from the SSM 219 is simultaneously transmitted to another requesting user SPN 103.

The VFS 209 on the MGMT SPN 205 manages the title lists and their locations in the ICE 100. In a typical computer system, the directory (data information) typically resides on the same disk on which the data resides. However, in the ICE100, the VFS 209 is centrally arranged to manage the distributed data, as the data for each title is distributed across multiple disks in the disk array, which in turn are distributed across multiple user SPNs 103. As previously described, the disk drives 111 on the user SPN103 primarily store subchunks of a title. The VFS 209 includes an identifier for the location of each sub-block, as described above, with PSNID, DD #, and LBA. The VFS 209 also includes identifiers of other portions of the external ICE100 (e.g., optical storage). When a user requests a title, a full set of directory information (ID/address) is made available to the UP executing on the user SPN103 that received the user's request. From there, the task is to transfer the subchunks off the disk drive to memory (buffer), move them via switch 101 to the requesting user SPN103, which requesting user SPN103 assembles the complete chunk in the buffer, delivers it to the user, and repeats until complete.

SSM 219 creates a "ready" message list in time-stamped order in ready message (RDY MSG) list 223. The order in which messages are received from the TPs on the user SPNs 103 need not be in timestamp order, but rather in TS order in the RDY MSG list 223. Just before the next transfer set, SSM 219 starts scanning RDY MSG list 223 from the earliest timestamp. SSM 219 first identifies the earliest TS in RDY MSG list 223 and generates and sends a corresponding TXR message to TP 215 of user SPN103 storing the corresponding subchunk to initiate the current transfer of that subchunk. SSM 219 continues to scan list 223 in TS order for each subsequent sub-block, generating a TXR message for each sub-block whose source and destination are not already included in the current sub-block transmission. For each TXCMD broadcast to all the user SPNs 103, each user SPN103 transmits only one subchunk at a time and receives only one subchunk at a time, although it may do both simultaneously. For example, if a TXR message is sent to a TP of SPN #10 to schedule the current subchunk transmission to SPN #2, SPN #10 cannot simultaneously transmit another subchunk. However, SPN #10 may receive the subchunks from another SPN at the same time. Further, SPN #2 cannot simultaneously receive another subchunk while receiving a subchunk from SPN #10, although SPN #2 may simultaneously transfer to another SPN because of the full-duplex nature of each of the ports of switch 101.

SSM 219 continues to scan RDY MSG list 223 until all user SPNs 103 have been considered, or when the end of RDY MSG list 223 is reached. Each entry in the RDY MSG list 223 corresponding to a TXR message is eventually removed from the RDY MSG list 223 (either when the TXR message is sent or after the transfer is completed). When the last transmission for the current period of time has ended, SSM 219 broadcasts a TX CMD packet that signals all user SPNs 103 to begin the next round of transmission. For the particular configuration described, each transfer occurs simultaneously over a period of approximately 4 to 5 seconds. During each transmit round, additional MSGs are sent to SSM 219 and new TXR messages are sent out to user SPN103 to schedule the next round of transmission, and the process is repeated. The time period between successive TX CMD is approximately equal to the time period necessary to transfer all bytes of the sub-block, including the packet overhead and inter-packet delay, plus a time period to clear all caches that may have occurred in the switches during transmission of the sub-block, typically 60 microseconds (μ s), plus a time period to account for any jitter caused by the delay in identifying the TX CMD by the individual SPN, typically less than 100 μ s.

In one embodiment, a duplicate or mirrored MGMT SPN (not shown) is a mirror of primary MGMT SPN 205, such that SSM 219, VFS 209, and the dispatcher are each duplicated on a redundant pair of dedicated MGMT SPNs. In one embodiment, the synchronous TX CMD broadcasts as a pulse (heartbeat) that indicates the health of MGMT SPN 205. A pulse is sent to the auxiliary MGMT SPN indicating that everything is good. In the absence of a ripple, the auxiliary MGMTSPN takes over all management functions for a predetermined period of time, such as for example for 5 ms.

Although the present invention has been described in considerable detail with reference to certain preferred versions thereof, other versions and variations are possible and contemplated. Those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for providing out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the following claims.

Claims

1. A synchronous data transfer system comprising:

a plurality of processor nodes;

a backbone network switch coupled to the plurality of processor nodes to enable communication between the plurality of processor nodes;

a plurality of storage devices distributed across the plurality of processor nodes and storing a plurality of titles, each title divided into a plurality of sub-blocks distributed across the plurality of storage devices;

a plurality of transfer processes, each executing on a corresponding one of the plurality of processor nodes and operable to send a message for each sub-block to be transferred from the local storage to a destination processor node, each message including a source node identifier and a destination node identifier; and

a synchronous switch manager process executing on at least one of the plurality of processor nodes, the process periodically broadcasting a transmit command to initiate each of a plurality of sequential transmit periods, the process receiving a plurality of messages and prior to each transmit period, the process selecting from the plurality of messages to ensure that each processing node transmits up to one sub-block and receives up to one sub-block during each transmit period, and transmits a plurality of transmit requests corresponding to the selected messages; and

wherein each transfer process that sends at least one message and received a transfer request from the Sync switch manager Process that identified a corresponding sub-block, sends the corresponding sub-block to the processor node identified by the destination node identifier during a next transfer period initiated by a broadcast transfer command.

2. The synchronized data transfer system of claim 1, wherein each of said plurality of messages includes a timestamp; and the synchronous switch manager process prioritizes the plurality of messages based on a time stamp order and sends the plurality of transfer requests in the time stamp order.

3. The synchronized data transfer system of claim 2, further comprising:

a plurality of user processes, each executing on a corresponding one of the plurality of processor nodes and operable to send a plurality of timestamp read requests; and is

Wherein each transfer process includes a timestamp from a corresponding timestamp read request into a corresponding message.

4. The synchronous data delivery system of claim 3 wherein the synchronous switch manager process time stamps the plurality of messages into a ready message list, time stamps the ready message list immediately prior to each of the plurality of sequential delivery periods, and selects messages based on time stamp priority.

5. The synchronized data transfer system of claim 4, wherein said synchronized switch manager process selects a message if an identified source processor node has not been selected to transmit a subchunk during a subsequent transfer period and if an identified destination processor node has not been selected to receive a subchunk during said subsequent transfer period.

6. The synchronized data transfer system of claim 1, further comprising:

each of the plurality of transfer processes storing received sub-block read requests in a read request queue, each sub-block read request indicating a local storage sub-block;

each of the plurality of storage devices reading the sub-blocks identified in the local read request queue in actual order;

each of the plurality of processor nodes enqueues the subblocks successfully read by the corresponding storage device as a successful read queue; and

each of the plurality of transfer processes sends a message to the synchronized switch manager process for each entry in the corresponding successful read queue.

7. The synchronized data transfer system of claim 6, wherein each of said sub-block read requests comprises a time-stamped read request, wherein entries in each of said successful read queues are listed in time-stamped order, and wherein each transfer process sends a message for each entry in the corresponding successful read queue in time-stamped order.

8. The synchronized data transfer system of claim 6, further comprising:

each of the plurality of transfer processes removes an entry from the respective successful request queue, wherein the entry is associated with a subchunk identified by the respective transfer request; and

a plurality of network transfer processes, each executing on a corresponding one of the plurality of processor nodes and each operable to establish a network packet to transfer the identified sub-block to a destination processor node in response to a transfer command.

9. The synchronous data transfer system of claim 1 wherein the network switch comprises a gigabit ethernet switch having a plurality of ports and each of the plurality of processor nodes is coupled to a corresponding port of the network switch.

10. The synchronized data transfer system of claim 1, wherein said plurality of processor nodes includes a management node that executes said synchronized switch manager process.

11. The synchronized data transfer system of claim 1, wherein said plurality of processor nodes includes a first management node that executes said synchronized switch manager process, and a second management node that executes a mirrored synchronized switch manager process.

12. A method of synchronizing a distributed sub-block of data transferred between a plurality of processor nodes coupled to a network switch, the method comprising:

periodically broadcasting, by a management process executing on at least one of the plurality of processor nodes, a transmit command to initiate each of a plurality of sequential transmit periods;

sending, by each processor node having at least one sub-block to be sent, a message to the management process for each sub-block to be sent, each message identifying a source processor node and a destination processor node;

selecting, by the management process, messages received from processor nodes to ensure that each processor node transmits up to one sub-block during a subsequent transfer period and that each processor node receives up to one sub-block during the subsequent transfer period;

sending, by the management process, a plurality of transfer requests, each transfer request being sent to a processing node that has sent the selected corresponding message; and

by each processor node receiving a transfer request, the sub-block identified by the received transfer request is transferred to the destination processor node in response to a subsequent transfer command.

13. The method of claim 12, further comprising:

before the sending of the message for each subblock to be sent, time stamping each message;

the selecting includes prioritizing based on a timestamp order, an

The sending the plurality of transfer requests includes sending the transfer requests in a time-stamped order.

14. The method of claim 13, further comprising:

sending, by at least one processor node, a plurality of timestamp read requests prior to said sending a message for each sub-block to be sent to the management process; and

wherein said time stamping each message comprises adding a time stamp from a received time stamp read request to a corresponding message.

15. The method of claim 14, wherein the selecting a message received from a processor node comprises:

organizing the received messages into a prepared message list in order of time stamps through said management process; and

through the management process, the preparation message list is scanned in time stamp order just before each transmission period.

16. The method of claim 15, the scanning comprising: selecting a message if the identified source processor node has not been selected to transmit a sub-block during a subsequent transmission period and if the identified destination processor node has not been selected to receive a sub-block during a subsequent transmission period.

17. The method of claim 16, wherein the scanning is complete when the entire prepare message list has been scanned, or if all processor nodes have been selected to transmit a subchunk or if all processor nodes have been selected to receive a subchunk.

18. The method of claim 12, further comprising:

storing received sub-block read requests in a read request queue before said sending a message for each sub-block to be sent to said management process, each sub-block read request indicating a request for a locally stored sub-block;

reading, by the local disk drive, the sub-blocks identified in the read request queue in actual order after the storing;

after said storing, listing entries for successfully read sub-blocks into a successful read queue; and

the sending a message for each sub-block to be sent comprises sending a message for each entry in the successful read queue.

19. The method of claim 18, wherein each sub-block read request comprises a timestamp read request; the listing of entries of successfully read subblocks into a successful read queue comprises listing the entries in a time stamp order, and the sending a message for each entry in the successful read queue comprises sending the messages in a time stamp order.

20. The method of claim 18, further comprising, after the list and before the sending a message:

removing an entry from the successful request queue associated with the sub-block identified by the corresponding transfer request; and

a network packet is established for transmitting the identified sub-block to the destination processor node in response to the transmit command.

21. The method of claim 12, further comprising executing the management processor on a first management node and executing a mirror management process on a mirror management node that is a mirror of the first management node.

22. A synchronous data transfer system comprising:

a plurality of storage processor nodes including first and second user nodes and a management node;

a backbone communication switch coupled to the plurality of storage processor nodes;

a plurality of headers each divided into a plurality of sub-blocks distributed across the plurality of storage processor nodes;

a user process executing on the first user node, the user process sending a plurality of timestamp read requests, each timestamp read request to request a corresponding sub-block;

a transfer process, executed on the second user node, that sends a message for each received timestamp read request requesting a locally stored sub-block, each message comprising a source node identifier and a destination node identifier; and

a management process executing on the management node that periodically broadcasts a transfer command via the switch to initiate each of a plurality of sequential transfer periods, that receives a plurality of messages, and that selects from the plurality of messages prior to each transfer period to ensure that each storage processing node transmits up to one sub-block and receives up to one sub-block during each transfer period, and transmits a plurality of transfer requests corresponding to the selected messages; and

wherein the transfer process, in response to receiving a transfer request from the management process, sends the corresponding sub-block to the storage processor node identified by the destination node identifier during a next transfer period initiated by a next broadcast transfer command.

23. The synchronized data transfer system of claim 22, wherein said management process selects from said plurality of messages based on timestamp priority.

24. The synchronized data transfer system of claim 23, wherein said management process first selects a message with a highest priority timestamp and then selects each subsequent message if an identified source node has not been selected to transmit a sub-block during said subsequent transfer period and if an identified destination node has not been selected to receive a sub-block during said subsequent transfer period.