This application claims the benefits of U.S. provisional application No.60/526,437, filed 12/02/2003, and is entitled "interactive broadband server system," serial No.10/304,378, filed 11/26/2002, continuation of the section of the pending U.S. patent application, which itself claims priority to U.S. provisional application No.60/333,856, filed 11/28/2001, all of which are co-inventors, commonly assigned, and incorporated herein by reference for all purposes and purposes.
Detailed Description
The following description is presented to enable any person skilled in the art to make and use the invention as provided within the context of a particular application and its requirements. Various modifications to the preferred embodiment will, however, be apparent to one skilled in the art, and the general principles defined herein may be applied to other embodiments. Thus, the present invention is not intended to be limited to the particular embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The architecture described herein accommodates elements of varying capabilities to avoid limiting the facility to the point in time when the initial system purchase is made. The use of commodity elements ensures recently proven technology, avoidance of single sources, and lowest cost per stream. Allowing individual components to fail. In many cases, there is no significant change in performance from the user's perspective. In other cases, there is a brief "self-repair" cycle. In many cases, multiple failures may be tolerated. Moreover, in most, if not all, cases, the system can recover without requiring immediate attention, making it ideal for "lights out" operation.
The content storage allocation and internal bandwidth are automatically managed by a Least Recently Used (LRU) algorithm that ensures that the content in the RAM cache and the hard drive array cache is appropriate for the current needs, and the backplane switch bandwidth is used in a most efficient manner. The bandwidth within the system is rarely, if ever, oversubscribed so there is no need to drop or delay the transmission of packets. This architecture provides the ability to fully utilize the composite bandwidth of each element so guarantees can be met, and the network is proprietary and under full control so no data path is overloaded even in the case of unexpected peak demand. Any bit rate stream can be accommodated, but a typical stream is expected to remain in the 1 to 20Mbps range. Non-isochronous content is accommodated on the available bandwidth basis. Bandwidth may be reserved deliberately if required by the application. The file may be of any size that maximizes storage efficiency.
FIG. 1 is a simplified block diagram of a portion of an Interactive Content Engine (ICE)100 implemented according to an exemplary embodiment of the present invention. For the sake of clarity, no adequate and complete understanding of the present invention is shown. The ICE100 includes a suitable multi-port (or multi-port) gigabit ethernet (GbE) switch 101 as a backplane fabric having multiple ethernet ports coupled to a plurality of Storage Processor Nodes (SPNs) 103. Each SPN103 is a simplified server including two gigabit ethernet ports, one or more processors 107, memory 109 (e.g., Random Access Memory (RAM)), and an appropriate number (e.g., four to eight) of disk drives 111. The first Gb port 105 on each SPN103 is connected to a corresponding port of the switch 101 for full duplex operation (simultaneous transmission and reception at each SPN/port connection) and is used to move data within the ICE 100. Another Gb port (not shown) delivers content output to the user (not shown).
Each SPN103 has high speed access to its local disk drive and to the other disk drives of the other four SPNs in each five SPN group. The switch 101 is a backplane for the ICE100 and not just a means of communication between the SPNs 103. For illustrative purposes, only five SPNs 103 are shown, with the understanding that the ICE100 typically includes a large number of servers. Each SPN103 functions as a storage, processing, and transmitter of content. In the configuration shown, each SPN103 is configured to use off-the-shelf components and is not a computer in the usual sense. While considering a standard operating system, such interrupt driven operating systems may present unnecessary bottlenecks.
Each title (e.g., video, movie or other media content) is not stored entirely on any single disk drive 111. Instead, the data for each title is divided and stored in several disk drives within the ICE100 to achieve the speed advantage of interleaved access. The content of a single title is distributed across multiple disk drives of multiple SPNs 103. Short "timeframes" of title content are collected from each drive in each SPN103 in a round robin (round robin) fashion. In this manner, the actual load is spread out of the drive count limits of the SCSI and IDE, resulting in a form of fail-safe operation, and organization and management of large numbers of titles.
In the particular construction shown, each content title is divided into fixed-size discrete blocks (typically about 2 Megabits (MB) per block). Each block is stored on a different set of SPNs 103 in a round-robin fashion. Each block is divided into four sub-blocks and a fifth sub-block representing parity is created. Each subchunk is stored on a disk drive of a different SPN 103. In the configuration shown and described, a sub-block size of approximately 512 Kilobytes (KB) (where "K" is 1024) matches each of the nominal data units of disk drive 111. SPNs 103 are grouped five at a time, and each group or set of SPNs stores a data block of a title. As shown, the five SPNs 103 are labeled 1-4 and "Parity," which collectively store the block 113 as five separate sub-blocks 113a, 113b, 113c, 113d, and 113e stored on SPNs 1, 2, 3, 4, and Parity, respectively. The sub-blocks 113a-113e are shown as being stored in a distributed manner on different DRIVEs for each different SPN (e.g., SPN1/DRIVE1, SPN2/DRIVE2, SPN3/DRIVE3, etc.), but may be stored in any other possible combination (e.g., SPN1/DRIVE1, SPN2/DRIVE1, SPN3/DRIVE3, etc.). Sub-blocks 1-4 include data and sub-block Parity includes Parity information for the sub-blocks of data. The size of each set of SPNs, although typically five, is arbitrary and could easily be any other suitable number, like for example 2 SPNs to 10 SPNs. To redundancy two SPNs use 50% of their storage, ten use 10%. Five is a trade-off between storage efficiency and probability of failure.
By distributing the content in this manner, at least two goals are achieved. First, the number of users that can view a single title is not limited to the number that can be served by a single set of SPNs, but rather by the bandwidth of all sets of SPNs put together. Thus, only one copy of each content title is required. The trade-off is a limit on the number of new viewers of a given title that can be created per second, which is a constraint far less than the administrative overhead of wasted space and redundant storage. A second goal is an increase in the overall reliability of the ICE 100. Failure of a single drive is masked by regenerating its contents in real time using parity drives, similar to a Redundant Array of Independent Disks (RAID). The failure of SPN103 is masked by the fact that it contains one drive from each of several RAID sets that each continue to operate. Users connected to a failed SPN are very quickly taken over by shadow (shadow) processes running on other SPNs. In the event of a failure of the disk drive and the entire SPN, the operator is notified to repair or replace the failed device. When the missing subchunk is reestablished by the user process, it is transferred back to the SPN that provided it, where it is cached in RAM (as if it had been read from the local disk drive). This avoids wasting time for other user processes to do the same reconstruction for popular titles, as later requests will fill from RAM as long as the sub-blocks are popular enough to remain cached.
The goal of the User Process (UP) running on each "user" SPN103 is to collect subchunks from its own disk plus the corresponding four subchunks from other user SPNs to assemble a block of video content for transmission. The user SPN is distinct from one or more administrative MGMT SPNs, which are constructed in the same manner but perform different functions, as described further below. A pair of redundant MGMT SPNs is contemplated to improve reliability and performance. The collection and assembly functions performed by each UP are performed multiple times on each user SPN103 for multiple users. As a result, a significant amount of data transfer occurs between the user SPNs 103. Typical ethernet protocols with packet collision detection and retries would otherwise be overwhelmed. Typical protocols are designed for random transmissions and depend on the idle time between those events. This approach is not used. In the ICE100, collisions are avoided by using full duplex, full switching architecture, and by carefully managing bandwidth. Most communications occur synchronously. The switch 101 itself is managed in a synchronized manner, as described further below, to coordinate the transmission. Because it is determined which SPN103 is transmitting and when a port is not overwhelmed by more data than it can handle within a given period of time. Indeed, the data is first collected in the memory 109 of the user SPN103, and then its transfer is managed synchronously. As part of the coordination, there is a status signal between the user SPNs 103. Unlike the actual content to the end user, the data size used to send signaling between the user SPN units is small.
If the transmission of the sub-blocks is allowed to occur randomly or asynchronously, the length of each sub-block (approximately 512 kbytes, where "K" is 1024) would otherwise overwhelm any buffer available in GbE switch 101. The time period for transmitting such much information is about 4 milliseconds (ms), and it is desirable to ensure that several ports do not attempt to transmit to a single port at the same time. Thus, as described further below, switch 101 is managed in a manner such that it operates synchronously, with all ports being fully utilized under full load conditions.
The redundant directory process that manages the file system (or virtual file system or VFS) is responsible for reporting where a given content title is stored when it is requested by a user. It is also responsible for allocating the required storage space when loading a new title. All allocations are in a monolithic block, each of which includes five sub-blocks. Space on each disk drive is managed within the drive by Logical Block Addresses (LBAs). The sub-blocks are stored in consecutive sectors or LBA addresses on the disk drive. The capacity of each disk drive in the ICE100 is represented by its maximum LBA address, which is divided by the number of sectors per sub-block.
Each title map or "directory entry" contains a list indicating where the blocks of its title are stored, and more specifically where the sub-blocks of each block are located. In the detailed embodiment, each entry in the list representing a subchunk includes a SPNID identifying a particular user SPN103, a disk drive number (DD #) identifying a particular disk drive 111 of the identified user SPN103, and a subchunk pointer (or logical block address or LBA) packed into a 64-bit value. Each directory entry contains a list of sub-blocks for about half an hour of content at a rated 4 mbps. This is equal to 450 blocks, or 2250 sub-blocks. Each directory entry is about 20KB with auxiliary data. When a UP executing on an SPN requests a directory entry, the entire entry is sent and stored locally for the corresponding user. Even if the SPN supports 1,000 users, only 20MB of memory is consumed for local list or directory entries.
The ICE100 maintains a database of all titles available to the user. This list includes the local disc library, real-time network programming, and titles at remote locations where permission and transport settings have been made. The database contains all metadata for each title including management information (licensing period, bit rate, resolution, etc.) and information of interest to the user (production, director, cast, play, author, etc.). When the user makes a selection, the directory of the Virtual File System (VFS)209 (fig. 2) is queried to determine if the title is already loaded in the disk array. If not, a loading process (not shown) is initiated for the piece of content and the UP is notified when viewable if necessary. In most cases, the delay is no greater than the mechanical delay of the optical disc retrieval machine (not shown), or about 30 seconds.
The information stored on the optical disc (not shown) includes all the metadata that was read into the database when the disc was first loaded into the library, as well as compressed digital video and audio representing the title and all the information that could be gleaned in advance about these data streams. For example, it contains pointers to all relevant information in the data stream, such as clock values and time stamps. It has been divided into sub-blocks with pre-computed and parity sub-blocks stored on disk. Generally, any content that can be pre-processed to save loading and processing overhead is included on the optical disc.
Included in the resource management system is a scheduler (not shown) that the UP consults to receive a start time for its flow (typically within a few milliseconds of the request). The scheduler ensures that the load on the system remains uniform, that latency is minimized, and that the bandwidth required within the ICE100 does not exceed the available bandwidth at all times. When a user requests a stop, pause, fast forward, rewind, or other operation that interrupts the flow of their stream, their bandwidth is reallocated and a new allocation is made for any new service requested (e.g., a fast forward stream).
FIG. 2 is a logical block diagram of a portion of the ICE100 illustrating a synchronous data transfer system 200 implemented according to an embodiment of the present invention. Switch 101 is shown coupled to several exemplary SPNs 103, including a first user SPN 201, a second user SPN 203, and a management (MGMT) SPN 205. As previously mentioned, a plurality of SPNs 103 are coupled to switch 101, and only two user SPNs 201, 203 are represented for purposes of explaining the present invention, and are implemented as virtually any SPN103 as previously described. MGMT SPN 205 is only implemented as virtually any other SPN103, but typically performs administrative functions rather than user-specific functions. SPN 201 indicates certain functions and SPN 203 indicates other functions of each user SPN 103. However, it is to be understood that each user SPN103 is configured to perform similar functions, such that the functions (and processes) described for SPN 201 are also provided at SPN 203, and vice versa.
As previously described, switch 101 operates at 1Gbps per port, so that each subchunk (about 512KB) takes about 4ms to pass from one SPN to another. Each user SPN103 performs one or more User Processes (UPs), each of which is used to support one downstream user. When a new block of a title is needed to refill the user output buffer (not shown), the next five subchunks from the list are requested from the other user SPNs that store those subchunks. Since multiple UPs may request multiple sub-blocks at substantially the same time, the sub-block transmission duration may otherwise overwhelm the buffering capacity of almost any GbE switch for a single port, regardless of the use for the entire switch. This is true for the switch 101 shown. If the sub-block transmission is not managed, then all five sub-blocks for each UP may return at the same time, overwhelming the output port bandwidth. It is desirable to tighten the timing of the transmission of the SPN of the ICE100 so that the most critical data is transmitted first and intact.
SPN 201 is shown to perform UP 207 in order to serve a corresponding downstream user. The user requests a title (e.g., a movie), which is forwarded to UP 207. UP 207 passes the Title Request (TR) to a VFS 209 (described further below) located on MGMT SPN 205. The VFS 209 returns the Directory Entry (DE) to the UP 207, and the UP 207 locally stores the DE as shown at 211. DE 211 includes a list that locates each subchunk (SC1, SC2, etc.) of the title, each entry including an SPNID identifying a particular user SPN103, a disk drive number (DD #) identifying a particular disk drive 111 of the identified SPN103, and an address or LBA identifying the particular location on the disk drive where the subchunk is provided. SPN 201 initiates a timestamp read request (TSRR) one at a time for each subchunk in DE 211. In the ICE100, the request is made immediately and directly. In other words, SPN 201 immediately and directly makes a request for a subchunk to a particular user SPN103 that stores data. In the illustrated configuration, the request is made in the same manner even if stored locally. In other words, even if the requested subchunk resides on the local disk drive of the SPN 201, it sends the request out via the switch 201 as in a remote arrangement. The network is a location that may be configured to recognize that a request is being sent from an SPN to the same SPN. It is simpler to handle all cases as well, especially in larger facilities where the probability that the request is actually local is less.
Although the request is sent out immediately and directly, the subchunks are each returned in a fully managed manner. Each TSRR uses the SPNID to a particular subscriber SPN and includes the DD # and LBA for the target subscriber SPN to retrieve and return data. The TSRR may also include any other identifying information sufficient to ensure that the requested subchunk is properly returned to the appropriate requester and to enable the requester to identify the subchunk (e.g., a UP identifier that distinguishes multiple UPs performed on the destination SPN, a subchunk identifier that distinguishes multiple subchunks for each data block, etc.). Each TSRR also includes a Time Stamp (TS) identifying the specific time when the original request was made. The TS identifies a priority of the request for isochronous transmission purposes, where the priority is based on time such that earlier requests exhibit higher priority. When received, the returned subchunk of the requested title is stored in local title memory 213 for further processing and transmission to the user requesting the title.
The user SPN 203 indicates the operation and support functions of a Transfer Procedure (TP)215 executing on each user SPN (e.g., 201, 203), to receive TSRRs and to return request subchunks. The TP 215 includes or otherwise interfaces with a storage process (not shown) that interfaces with the local disk drive 111 on the SPN 203, which is used to request and access storage subchunks. The storage process may be implemented in any desired manner, such as a state machine or the like, and may be a separate process that interfaces between the TP 215 and the local disk drive 111, as is known to those skilled in the art. As shown, the TP 215 receives one or more TSRRs from one or more UP's executing on other user SPNs 103 and stores each request in a Read Request Queue (RRQ)217 in its local memory 109. RRQ 217 stores a list of requests for sub-blocks SCA, SCB, etc. The disk drive storing the requested sub-blocks removes the corresponding requests from RRQ 217, sorts them in actual order, and then performs each read in sorted order. Access to the sub-blocks on each disk is managed in groups. Each group is sorted in actual order according to the "elevator seek" operation (one scan from low to high, the next scan from high to low, etc., to scan back and forth across the pan head of the pan surface, pausing to read the next sequential sub-block). The successfully read requests are stored in a Successful Read Queue (SRQ)218 sorted in TS order. The requests for failed reads (if any) are stored in a Failed Read Queue (FRQ)220 and the failure information is forwarded to a network management system (not shown) which determines the errors and appropriate corrective action. Note that in the illustrated configuration, queues 217, 218, and 220 store request information rather than actual sub-blocks.
Each sub-block that was successfully read is placed in the memory reserved for the LRU cache of the most recently requested sub-block. For each retrieved subchunk, the TP 215 creates a corresponding Message (MSG) that includes the TS for the subchunk, the Source (SRC) of the subchunk (e.g., the SPNID from which the subchunk is being transmitted and its actual memory location, and any other identifying information), and the Destination (DST) SPN (e.g., SPN 201) to which the subchunk is to be transmitted. As shown, the SRQ 218 includes messages MSGA, MSGB, etc. for the sub-blocks SCA, SCB, etc., respectively. After reading and caching the requested subchunks, TP 215 sends the corresponding MSG to a Synchronous Switch Manager (SSM)219 executing on MGMT SPN 205.
The SSM 219 receives and prioritizes multiple MSGs from the user SPN from the TP and ultimately sends a transmit request (TXR) to the TP 215 that identifies one of the MSGs in its SRQ 218, such as using a message identifier (MSGID). When the SSM 219 sends the TXR with the MSGID identifying the subchunk in the SRQ 218 to the TP 215, the request list moves from the SRQ 218 to a Network Transfer Process (NTP)221 (where "move" indicates that the request is removed from the SRQ 218), which process 221 establishes a packet that is used to transfer the subchunk to the destination user SPN. The order in which the list of subblock requests is removed from the SRQ 218 need not be sequential, although the list is in timestamp order, as only the SSM 219 determines the proper ordering. SSM 219 sends a TXR to each other SPN103 having at least one subchunk to send unless the subchunk is to be sent to a UP on an SPN103 that has scheduled to receive an equal or higher priority subchunk, as described further below. SSM 219 then broadcasts a single transmit command (TX CMD) to all user SPNs 103. TP 215 responds to the TX CMD command broadcast by SSM 219 to instruct NTP 221 to transmit the subchunk to user SPN103 for a request UP. In this manner, each SPN103 that has received TXR from the SSM 219 is simultaneously transmitted to another requesting user SPN 103.
The VFS 209 on the MGMT SPN 205 manages the title lists and their locations in the ICE 100. In a typical computer system, the directory (data information) typically resides on the same disk on which the data resides. However, in the ICE100, the VFS 209 is centrally arranged to manage the distributed data, as the data for each title is distributed across multiple disks in the disk array, which in turn are distributed across multiple user SPNs 103. As previously described, the disk drives 111 on the user SPN103 primarily store subchunks of a title. The VFS 209 includes an identifier for the location of each sub-block, as described above, with PSNID, DD #, and LBA. The VFS 209 also includes identifiers of other portions of the external ICE100 (e.g., optical storage). When a user requests a title, a full set of directory information (ID/address) is made available to the UP executing on the user SPN103 that received the user's request. From there, the task is to transfer the subchunks off the disk drive to memory (buffer), move them via switch 101 to the requesting user SPN103, which requesting user SPN103 assembles the complete chunk in the buffer, delivers it to the user, and repeats until complete.
SSM 219 creates a "ready" message list in time-stamped order in ready message (RDY MSG) list 223. The order in which messages are received from the TPs on the user SPNs 103 need not be in timestamp order, but rather in TS order in the RDY MSG list 223. Just before the next transfer set, SSM 219 starts scanning RDY MSG list 223 from the earliest timestamp. SSM 219 first identifies the earliest TS in RDY MSG list 223 and generates and sends a corresponding TXR message to TP 215 of user SPN103 storing the corresponding subchunk to initiate the current transfer of that subchunk. SSM 219 continues to scan list 223 in TS order for each subsequent sub-block, generating a TXR message for each sub-block whose source and destination are not already included in the current sub-block transmission. For each TXCMD broadcast to all the user SPNs 103, each user SPN103 transmits only one subchunk at a time and receives only one subchunk at a time, although it may do both simultaneously. For example, if a TXR message is sent to a TP of SPN #10 to schedule the current subchunk transmission to SPN #2, SPN #10 cannot simultaneously transmit another subchunk. However, SPN #10 may receive the subchunks from another SPN at the same time. Further, SPN #2 cannot simultaneously receive another subchunk while receiving a subchunk from SPN #10, although SPN #2 may simultaneously transfer to another SPN because of the full-duplex nature of each of the ports of switch 101.
SSM 219 continues to scan RDY MSG list 223 until all user SPNs 103 have been considered, or when the end of RDY MSG list 223 is reached. Each entry in the RDY MSG list 223 corresponding to a TXR message is eventually removed from the RDY MSG list 223 (either when the TXR message is sent or after the transfer is completed). When the last transmission for the current period of time has ended, SSM 219 broadcasts a TX CMD packet that signals all user SPNs 103 to begin the next round of transmission. For the particular configuration described, each transfer occurs simultaneously over a period of approximately 4 to 5 seconds. During each transmit round, additional MSGs are sent to SSM 219 and new TXR messages are sent out to user SPN103 to schedule the next round of transmission, and the process is repeated. The time period between successive TX CMD is approximately equal to the time period necessary to transfer all bytes of the sub-block, including the packet overhead and inter-packet delay, plus a time period to clear all caches that may have occurred in the switches during transmission of the sub-block, typically 60 microseconds (μ s), plus a time period to account for any jitter caused by the delay in identifying the TX CMD by the individual SPN, typically less than 100 μ s.
In one embodiment, a duplicate or mirrored MGMT SPN (not shown) is a mirror of primary MGMT SPN 205, such that SSM 219, VFS 209, and the dispatcher are each duplicated on a redundant pair of dedicated MGMT SPNs. In one embodiment, the synchronous TX CMD broadcasts as a pulse (heartbeat) that indicates the health of MGMT SPN 205. A pulse is sent to the auxiliary MGMT SPN indicating that everything is good. In the absence of a ripple, the auxiliary MGMTSPN takes over all management functions for a predetermined period of time, such as for example for 5 ms.
Although the present invention has been described in considerable detail with reference to certain preferred versions thereof, other versions and variations are possible and contemplated. Those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiment as a basis for designing or modifying other structures for providing out the same purposes of the present invention without departing from the spirit and scope of the invention as defined by the following claims.