HK1113206A

Movatterモバイル変換

Info

Publication number: HK1113206A
Application number: HK08102498.5A
Authority: HK
Inventors: R.佩里
Original assignee: 塞门铁克操作公司
Priority date: 2004-08-24
Filing date: 2005-08-24
Publication date: 2008-09-26

Description

recovery from storage transaction failures using checkpoints

Technical Field

[0001] The disclosed technology relates to the field of data storage and, more particularly, to time-dependent data storage and retrieval.

Background

[0002] Enterprises are increasingly dependent on computer systems that allow data sharing across the enterprise. Data storage systems that have been developed to store large amounts of data are often of paramount importance to enterprises. Thus, an interruption or failure of the data storage system may disrupt the operation of the entire enterprise.

[0003] Typically, data used by applications running on a computer system is stored on both primary storage (e.g., a disk) and secondary storage (e.g., tape and relatively inexpensive disk drives) for protection. When these applications run, data changes as business activities progress. Information technology departments typically deal with many problems with data storage systems. However, in general, these problems can be divided into two broad categories: hardware failures and data corruption.

[0004] The commercial significance of data storage systems and the importance of the integrity of the data they store and maintain have generated a corresponding tremendous interest in systems that provide data protection and data recovery. Currently, mirroring and snapshot technology are the two main approaches available to businesses interested in data recovery. In the event of a system failure, data recovery allows the enterprise to recover data from a previous point in time and resume operations using uncorrupted data. Once a hardware failure or the time of the corruption event (or events) is identified, recovery can be achieved by returning to a known point in time at which the stored data was not corrupted.

[0005] Typically, the data storage device includes individual storage units, such as cells, blocks, sectors, and the like. A read command generated by a host system (generally meaning one or more host systems) directs the information system to provide the data specified in the request to the host. Traditionally, the designation of information is based on its location within the data storage device, e.g., one or more particular blocks. Write commands are executed in a similar manner. For example, data is written to a particular location of memory in response to an I/O request generated by the host system. The location identifier provides a direct association between the data and the storage unit in which the data is stored. Accordingly, the location identifier is used to read and update the data.

[0006] In terms of hardware failures for data protection issues, vendors provide several different mechanisms to help prevent hardware failures from affecting the availability and performance of applications, e.g., disk mirroring. This is a mechanism where multiple disks are grouped together to store the same information, allowing a disk failure without preventing the application from retrieving the data. In a typical setup, a user allocates 1-4 mirror disks for each application data disk. Each write request sent to the application primary disk is also sent to the mirror copy disk so the user actually has N (where N is typically between 2 and 5) disks on which the data is identical. As a result, the mirroring method provides at least one full copy of the current data at the time. Thus, if one disk fails, the user still has application data that remains on the other mirrored disk. A Redundant Array of Independent Disks (RAID) provides one example of a mirroring system.

[0007] However, the mirroring method is ineffective when data corruption occurs. Data corruption occurs in many forms, but it is typically recognized when a user application completely stops running due to data being written to disk. There are many possible sources of data corruption, such as a failure to attempt to upgrade an application, a user accidentally deleting critical information, a rogue user intentionally corrupting application data, a computer virus, and so forth. Whatever the cause, the mirroring work effectively hinders the users who have experienced data corruption because mirroring copies the wrong data to all mirrors simultaneously. Thus, all copies of the data are corrupted.

[0008] In addition, because the disks are continually updated, a backup of historical data, i.e., a snapshot of the data that occurred in the data storage at the past time T, is created only if the system is instructed to save the backup at or before time T. Thus, at time T +1, the system cannot provide a backup of the current data at time T. In addition, each stored unit is preserved regardless of whether the data stored therein has not changed since the time the previous backup was created. This approach is inefficient and expensive because it increases the storage capacity required to back up the data storage device at multiple points in time. Also, when used with larger data storage systems, mirroring becomes less efficient and more error prone because large systems span hundreds of disks and the system cannot guarantee that each disk is backed up at the same point in time. Thus, a complex and error-prone process is used to attempt to establish a simultaneous backup for the entire data storage system.

[0009] As described above, snapshots, also known as time-images single points, are often created along with the mirroring system. Alternatively, the snapshot approach may be used as a separate data storage and recovery approach. In snapshot, the user selects periodic points in time when the current contents of the disk are to be copied and written to a different storage device or to an assigned group of storage units within the same storage device. However, this method has the same disadvantage as the mirroring method, that is, all snapshots are created at the current point in time at the time along with a user request or due to a previously predetermined instruction to create a snapshot of the stored data. Neither data mirroring nor data snapshots, alone or in combination, allow a user to use later knowledge to recreate a current data set at some past time. Because the data stored in each storage unit is not associated with an individual time identifier, the user cannot return to viewing the data from a particular point in time unless a historical backup was previously created for that time at the same time. Restoring the data at the intermediate time is not possible, e.g., time (T-1), between the current time (T) and the time the last backup disk was saved (e.g., T-2). Also, generating a single point-in-time image is often a lengthy process. As storage capacity and data set size have increased, image generation time has become more important.

[0010] Thus, the storage industry has focused on faster and more frequent generation of images. Data recovery system providers using magnetic tape have attempted to provide larger, more scalable tape libraries by increasing system capacity and the number of tape heads, thereby allowing parallel operation. Vendors of disk-based systems have focused on how to use disk drives to provide more single point-in-time images with improved response times. In one approach, one of the N number of mirror disks is taken offline at a specified time to create a single point-in-time image at that time. If the number of mirrored disks is increased sufficiently, the method allows for an increased number of images. However, this approach significantly increases the storage capacity required at each point in time, for example, for a 5 gigabyte (terabyte) application, 30 gigabytes of storage are required to support 2 standard mirror disks and 4 point in time images. Because these solutions merely attempt to scale existing approaches, they fail to provide a solution that can still be used as the capacity of data storage systems continues to increase.

Disclosure of Invention

[0011] The disclosed techniques address the shortcomings of current systems by facilitating data recovery at substantially any previous point in time, even when a request is made at a time after the recovery time.

[0012] In one embodiment, the disclosed techniques may be used to checkpoint a copy-on-write sequence of operations that is useful in recovering from a subsequent storage transaction failure. For example, in a redundant system, there may be one primary processor for handling I/O operations, and one or more secondary processors that may complete any in-process I/O operations of the primary processor upon detection of an error or failure in the primary processor. The disclosed embodiments of checkpoints provide information useful to successfully processing outstanding I/O operations upon an auxiliary processor being instructed not to replace a primary processor. At the same time, embodiments of the disclosed technology facilitate the use of these checkpoints in a manner that is integrated with the storage of other transactional information. Further, embodiments of the disclosed technology facilitate utilization of processing optimizations by the primary processor because the secondary processor does not need to know any optimizations attempted by the primary system to successfully intervene to assist the primary processor, and the secondary processor can use the disclosed checkpoint information to determine what processing the secondary processor needs for any outstanding I/O operations. This is particularly advantageous for high speed systems where there may be thousands, tens of thousands or more outstanding I/O transactions at any given time.

[0013] An illustrative copy-on-write operation sequence incorporating one or more checkpoints may include, for example, the following sequence of operations: in one embodiment, a method includes receiving a first write request identifying payload data to be written starting at a first address of a first data store, reading original data associated with the first address of the first data store, copying the original data to a second data store starting at a second address, recording transaction information associated with the first write request (e.g., a flag associated with the first address, the second address, and/or a time at which the first write request was received), generating a first checkpoint to confirm that the transaction information was successfully recorded and the original data was successfully copied to the second data store, writing the payload data to the first data store starting at the first address, confirming successful completion of a copy-on-write operation sequence, and generating a second checkpoint to confirm successful completion of such operation sequence.

[0014] The first and second checkpoints, either individually or in combination, may serve as a basis for forming a representation of one or more storage units (or portions thereof) as they existed prior to a storage transaction failure (e.g., hardware failure, power failure, etc.). The first checkpoint, which may be stored as part of the transactional information, may serve as a basis for storing information associated with the first write request into one or more queues of the processor module. The second checkpoint, which may be stored as part of the transactional information, may serve as a basis for removing information associated with the second write request from one or more queues of the processor module.

[0015] In one embodiment, the disclosed techniques may be used to develop systems and execute methods to recover from failures associated with copy-on-write operation sequences (e.g., hardware/power failures of processor modules). One or more write requests queued prior to the failure and corresponding to operations within the copy-on-write operation sequence may be identified. It may be determined whether a first checkpoint was formed in response to completing a first portion of the copy-on-write sequence. The first portion of the copy-on-write operation sequence may include one or more operations associated with copying original data stored at a location (identified by a queued write request) within the first data store to the second data store. A further determination may also be made as to whether a second checkpoint has been formed in response to completion of at least a second portion of the copy-on-write operation sequence. The second portion of the copy-on-write operation sequence may include one or more operations associated with writing payload data specified by the queued write request to a location within the first data store.

[0016] Based on the first and/or second checkpoints, one or more queued write requests can be processed to recover, at least in part, from the failure. For example, and when the first checkpoint cannot be located, processing of the queued write request may include queuing for execution substantially all operations associated with the copy-on-write operation sequence. Alternatively and upon locating the first checkpoint but failing to locate the second checkpoint, processing of the queued write requests may include queuing a subset of operations for execution where such queued subset of operations includes operations associated with the second portion of the copy-on-write operation sequence rather than the first portion. Alternatively and upon locating the first and second checkpoints, processing of the queued write requests may include removing operations in the copy-on-write operation sequence from one or more operation queues.

[0017] In one embodiment, the disclosed techniques may be used to develop systems and execute methods to facilitate recovery from storage transaction failures (i.e., errors that occur when writing to and/or otherwise manipulating storage units, such as may be encountered in the event of a power failure, hardware failure, or data corruption). A write request identifying payload data to be written beginning at the first address of the first data store may be received by the storage management device prior to fulfilling the write request. The original data associated with the first address of the first data store is then copied to a second data store beginning at the second address. Transaction information associated with the write request may be recorded and may include, for example, the first and second addresses and the time at which the write request was received. A first indicator (i.e., a first checkpoint) may also be generated to confirm successful recording of the transaction information in order to optimize future recovery effects. The first indicator may further confirm that the original data associated with the first address of the first data store has been successfully copied to the second data store.

[0018] Once the original data starting at the first address of the first data store has been successfully copied to the second address of the same or a different data store, the payload data identified by the write request may be written to the first data store starting at the first address — effectively overwriting the original data, rather than the copy now stored at the second address. Further, a second indicator (e.g., a second checkpoint) may be generated to confirm that the payload data was correctly written to the first data store and this second indicator may be detected in future recovery work to indicate that a second write request, which may occur after the time that the earlier write request was received, should be re-executed. The first indicator, the second indicator, and/or at least some of the transaction information may then be used, at least in part, to recover from the storage transaction failure. For example, a representation of at least one memory location as it existed prior to the failure of the storage transaction may be formed based on at least some of the original data copied in the second memory and the time at which the write request was received. The transaction information may also be searched to identify a time at which the write request was received based on a selected time substantially consistent therewith. Further, based on the discovery of the second indicator in the transaction information, information associated with the second write request may be removed from one or more queues of the standby processor module where such standby processor module assumes the role of the primary processor module in response to a storage transaction failure (e.g., a power and/or other hardware failure that disables the primary processor module). Based on the first indicator being found in the transaction information, the one or more queues of the standby processor module may also be loaded with information associated with the third write request.

[0019] In one embodiment, the disclosed technique detects a storage transaction failure (e.g., a hardware or power failure of the primary processor module) that occurs at time T1. One or more first indicators associated with a first write request received prior to time T1 may be identified confirming that previously stored data starting at a first address specified by the write request has been successfully copied to a second address and such first indicators may serve as a basis for loading information associated with the first write request into one or more queues of the standby processor module. At least some of the copied data may serve as a basis for forming a representation of at least one memory location (as the at least one memory location existed prior to the failure of the memory transaction). A second indicator that confirms that the payload data specified by the first write request has been successfully written to the data store beginning at the first address may also be identified and such second indicator may, for example, serve as a basis for removing information associated with the second write request from one or more queues of a standby processor module that may assume the role of the main processor module in response to a storage transaction failure.

Drawings

[0020] In the drawings, like reference characters generally refer to the same parts throughout the different views. Also, the drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention.

[0021] Fig. 1 is a block diagram of a storage system including a current store and a time store according to an embodiment of the present invention.

[0022] FIG. 2 is a diagram depicting an embodiment of an input/output (I/O) request sent by a host to a storage management device.

[0023] FIG. 3 is a table depicting a series of write commands directed to a data store in one embodiment of the invention.

[0024] FIG. 4 is a block diagram depicting the generation of a plurality of prior images of a data store according to an embodiment of the present invention.

[0025] FIG. 5 is a block diagram depicting the generation of a dynamic current memory in accordance with an embodiment of the present invention.

[0026] FIG. 6 is a timeline depicting the generation of a restore data store.

[0027] Fig. 7A and 7B are tables describing the contents of the current memory and the time memory when a series of write commands are directed to the current memory. FIG. 7A depicts a current memory. Fig. 7B depicts a time memory.

[0028] FIG. 8 is a table depicting the generation of a prior image of a data store according to an embodiment of the present invention.

[0029] FIG. 9 is a block diagram depicting a processor module according to an embodiment of the invention.

[0030] FIG. 10 is a block diagram depicting further details of a storage management device according to an embodiment of the invention.

[0031] FIG. 11 is a block diagram of an I/O manager according to an embodiment of the invention.

[0032] FIG. 12 is a block diagram of a storage management device according to an embodiment of the present invention.

[0033] FIG. 13 is a block diagram of a memory system according to an embodiment of the present invention.

[0034] FIG. 14A is a flow diagram of an illustrative embodiment of a method for providing a modification history for a location within a data store in accordance with the present invention.

[0035] FIG. 14B is a flow chart of another illustrative embodiment of a method for providing a modification history for a location within a data store in accordance with the present invention.

[0036] FIG. 15 is a diagram depicting an embodiment of an I/O request sent by a host to a storage management device.

[0037] FIG. 16 is a diagram illustrating an embodiment of an I/O response sent by a storage management device to a host.

[0038] FIG. 17 is a timeline depicting a series of write operations directed to a data store in an embodiment of the present invention.

[0039] FIG. 18 is a diagram illustrating an embodiment of a history index generated by a storage management device according to the present invention.

[0040] FIG. 19 is a diagram depicting an embodiment of an I/O request sent by a host to a storage management device.

[0041] FIG. 20 is a diagram illustrating an embodiment of an I/O response sent by a storage management device to a host.

[0042] FIG. 21 is a block diagram of a storage management device according to an embodiment of the present invention.

[0043] FIG. 22 is a flow chart of an illustrative embodiment of a method for storing data in accordance with the present invention.

[0044] FIG. 23 is a block diagram of a multiprocessor system according to an embodiment of the present invention.

[0045] FIG. 24 is a flow diagram of an illustrative embodiment of a method for maintaining a substantially coherent running clock for a multiprocessor system in accordance with the present invention.

[0046] FIG. 25 is a graph of time according to a slave processor module internal clock within a multiprocessor system versus time according to a master processor module internal clock within the multiprocessor system.

[0047] FIG. 26 is a block diagram of a storage management device according to an embodiment of the present invention.

[0048] FIG. 27 is a table depicting record indices for a set of write commands according to an embodiment of the present invention.

[0049] FIG. 28 depicts a map generated in accordance with an embodiment of the present invention.

[0050] FIG. 29 is a block diagram of a system for processing I/O requests according to an embodiment of the present invention.

[0051] FIG. 30 is a flowchart of an illustrative embodiment of a method for processing I/O requests in accordance with the present invention.

[0052] FIG. 31 is a table corresponding to I/O requests according to an embodiment of the present invention.

[0053] FIG. 32 depicts a queue for processing I/O requests according to an embodiment of the invention.

[0054] FIG. 33 is a block diagram of a system in accordance with an embodiment of the present invention.

[0055] FIG. 34 is a block diagram of a system according to an embodiment of the present invention.

[0056] FIG. 35 is a block diagram of a method according to an embodiment of the invention.

[0057] 36A-36D depict exemplary embodiments of binary trees according to embodiments of the present invention.

[0058] FIG. 37 depicts a block diagram of a storage management device, according to an embodiment of the invention.

[0059] FIG. 38 depicts an exemplary method for checkpointing according to an embodiment of the present invention.

[0060] FIG. 39 depicts a block diagram of an exemplary embodiment of the present invention.

[0061] FIG. 40 depicts an exemplary method for checkpointing according to an embodiment of the present invention.

[0062] FIG. 41 is a block diagram of a storage management device according to an embodiment of the present invention.

[0063] FIG. 42 is a flow diagram of an illustrative embodiment of a method for recording write requests directed to a data store and for enabling generation of at least a portion of a time map of at least a portion of the data store for past times.

[0064] FIG. 43 is an exemplary block diagram illustrating an illustrative embodiment of a method for recording write requests directed to a data store and enabling generation of at least a portion of a time map of at least a portion of the data store for a past time as described in FIG. 42.

Detailed Description

[0065] FIG. 1 provides an overview of a storage system 30 that allows for the generation of a map of a data store from a point in time prior to a request time. The host 34 communicates with the physical storage 36 through a storage management device 38. In one embodiment, the physical memory 36 stores digital data. In one version of this embodiment, the physical storage 36 is one or more disk drives. For example, the disk drive may be a magnetic disk (magnetic disk) drive, an optical disk drive, or a combination of both types of disk drives. In another version of this embodiment, the physical storage 36 comprises one or more tape drives. The physical storage 36 may be a single drive or a combination of drives, or a storage area network. The physical storage 36 itself may be a virtual drive presented by any of a variety of storage networks, devices, or controllers. For example, the physical storage 36 may be a mirrored disk or RAID system, or other storage device.

[0066] The host may be any type of network or system(s) that accesses physical memory 36 and/or any other form of data storage. In one embodiment, host 34 comprises a number of computers on a computer network. The host may include a storage network that is accessible to one or more users through a plurality of workstations, personal computers, or a combination of both.

[0067] In one embodiment, the storage management apparatus 38 may itself be a "storage device". For example, it may be a separate device having a processor and memory. The functionality of the storage management device 38 described herein may also be integrated into existing enterprise system storage area networks. In one embodiment, the storage management device 38 is implemented as a firmware layer of the storage system. In one embodiment, the storage management device 38 uses the current storage A44 and time storage A46 data for disk volume A. Although the figure shows the current store A44 and the time store A46 as being located within the storage management device 38, preferably, data associated with one or both of the current store A44 and the time store A46 is stored in the physical store 36. In this case, the storage management device 38 keeps track of the data in the current store a and the time store a in its memory, for example in the form of indices and pointers, and reads and writes data from the physical store 36. For example, separate groups of locations of storage in physical storage 36 may be allocated to current storage a44 and time storage a46, or their data may be mixed on physical storage.

[0068] The current store A44 and the time store A46 may also be implemented in Random Access Memory (RAM) or other memory located within the storage management device 38. In a version of this embodiment, the current store A44 and the time store A46 are located in different memories. Further, the type of media storing current store A44 may be different from the media storing time store A46, e.g., current store A46 may be located on a disk drive and time store A44 on RAM. In another version, the current store A44 and the time store A46 contain different partitions (sections) of the same memory. In another embodiment, both current store A44 and time store A46 comprise physical disks, which may be physical storage 36 or otherwise. The current store A44 and the time store A46 may be stored on the same physical disk, or they may both be stored in portions of many different physical disks.

[0069] The current store A44 stores the current data and the time store A46 stores older data from the current store A44 that has been replaced (i.e., overwritten) by newer data thereafter. The storage management device 38 uses information from one or both of the current store A44 and the time store A46 to generate and present to the host 34 current and past images of disk volume A. In one embodiment, each pair of current store A44 and time store A46 implements one or more logic devices. In one version of this embodiment, the storage management device 38 does not include a disk drive, but uses the physical storage 36 to store data on such a virtual drive.

[0070] The storage management device 38 communicates with the host 34 over a first communication link 40. The first communication link 40 may be any type of data communication link, such as a local area network LAN, a storage network or a bus including fibre channel and Small computer System interface ("SCSI"). Ethernet (e.g., gigabit ethernet) and wireless communications are other possible links that may be used for the first communication link 40. In one embodiment, the storage management device communicates the SCSI protocol at the logical layer and can communicate using one or more of a variety of physical layers, including SCSI bus, fibre channel 2, or iSCSI over Ethernet. In response to I/O requests by the hosts 34 over the communication link 40, the storage management device 38 acts as if it were physical storage 36. The host 34I/O requests may include read and write commands to units of storage.

[0071] The storage management device 38 communicates with the physical storage 36 over a second communication link 42. The second communication link 42 may also be any type of data communication link, such as a LAN, a storage network, or a bus including, but not limited to, fibre channel and Small Computer System Interface (SCSI), Integrated Drive Electronics (IDE), FCon, and FiCon. Ethernet (e.g., gigabit ethernet) and wireless communications are other possible links that may be used for the second communication link 42. In one embodiment, the physical storage 36 and the second communication link 42 are implemented in a storage area network.

[0072] With the main storage systems to date, data stored on devices is indexed by address, which consists of the device and an offset. The storage address space is divided into blocks (e.g., sectors), where each block is 512 bytes long. When presented as an I/O request, the I/O request is sent to a particular device/disk/storage unit and the address is known as the Logical Block Address (LBA) and length. In this example, the block contains a unit of storage and the LBA indicates the unit of storage where the I/O operation begins, i.e., a particular 512 byte block that is part of the device. The length indicates how many 512 byte blocks the I/O request will operate on. For example, to read 4096 bytes from a device starting at byte 8192, the LBA would be set to 16 and the length would be 8. Block sizes smaller or larger than 512 bytes may also be used, for example, the block length may be 520 bytes. Additionally, a memory location may be any portion of a uniquely addressable memory address space.

[0073] In one embodiment, time is an additional dimension in the second portion of the address space for a given storage device. The user may request a particular LBA (and associated block range) and the user is also provided the option of requesting a particular LBA/range combination at a particular point in time. The time is selected from substantially continuous time intervals and need not be determined in advance. This function may be provided at the block addressing level and it may be applied to the entire device to create variable time storage points.

[0074] In one embodiment, the command for the storage management device 38 includes an address that includes a location identifier and a time identifier. In one implementation, the location identifier may include at least one of a logical device identifier and a storage unit having the logical device. The time identifier may be the current time or may be the recovery time, i.e. a previous point in time of the data that needs to be stored in the memory location. In this description, the previous time that the host 34 requested the data is referred to as the "recovery time". "request time" refers to the time at which the host 34 requests data from the recovery time. The stored location of the digital data may be accessed by specifying an address that includes a location or an address and a time. The storage management device 38 may thus present successive "prior images" of data storage to the host 34, regardless of whether snapshots were generated prior to the request time, where each prior image is a disk view at recovery time. In one embodiment, the increment defining the minimum elapsed time between successive time identifiers is sufficiently small that it allows the previous data store to be generated from a substantially continuous time interval. In one version of this embodiment, requests for current images may be responded to using data located entirely on the current store A44, without using any data from the time store A46. However, as will be described in more detail below, a request for data from a previous time (i.e., a previous image) may require data from both the current store A44 and the time store A46.

[0075] In one embodiment, each host 34I/O request includes one or more target units of storage (e.g., physical disks, logical devices, virtual devices, etc.), a first unit of storage (e.g., LBA address, etc.), a length, and a time identifier for the read command, as indicated by a device identifier. The write command includes a data payload containing the data being written to the target storage unit.

[0076] In another embodiment, the time identifier is implicit in the sense that the storage management device 38 provides a logical device that is a view of another first logical device at an earlier time. The second logical device may be established by out-of-band communication (e.g., at a console of the storage management device) or via in-band communication between the host 34 and the storage management device 38. In one embodiment, once the second logical device is established, the storage unit associated therewith may be accessed by the requested data from the second logical device, rather than by an explicit request for data at a particular time.

[0077] In one embodiment, the time store includes control information, also referred to as "metadata" (meta data), and payload data. In one version of this embodiment, the control information includes a timestamp indicating when a particular unit of storage in the current store 44 was directed to be overwritten (as a result of a write operation), the location in the current store 44 of the unit of storage from which the data originated, and the location in the time store 46 where the old data is now stored. The payload data stored in the time memory 46 may include data that previously appeared in the current memory 44 but has been replaced by new data.

[0078] FIG. 2 depicts one embodiment of an I/O request, and in particular, a time-based read command sent by the host 34 to the storage management device 38. In one embodiment, the I/O request is a SCSI command. Fig. 2 identifies each bit included in the 32 bytes of the command block 88. In byte 0, the opcode identifies the type of command to be executed, i.e., a time-based read command. Bytes 2-9 are for a logical block address that identifies the first unit of storage and on which the read command operates. Bytes 10-13 are used for a transfer length that indicates the number of blocks being read, starting with the storage unit (i.e., block) identified by the logical block address. Bytes 14 and 15 are reserved for future use. Byte 16 is a relative chk field that indicates whether the time field is relative or absolute. If the RelativeChk field is 0, the time specified in the command block is based on the present time; therefore, 0 indicates that the specified time is the past time calculated from the current time. For example, the recovery time T-5000 specified at request time T provides an example of a read command having a recovery time that is based on the current time T, i.e., the recovery time is 5000 time increments prior to the current time. If the RelativeChk field is non-zero, the specified time is specified absolutely, i.e., without reference to another time. For example, such I/O requests may include relative time and the storage management device 38 may have a minimum time increment of 1 second or less. In another embodiment, the I/O request may include absolute time and the minimum time increment may be 1 millisecond or less.

[0079] Bytes 17-24 include the specified read time, which is relative or absolute. If the read time is absolute, the recovery time is contained in bytes 17-24. If the read time is relative, the recovery time is calculated based on subtracting the specified read time from the current time. Bytes 25-30 are reserved for future use. Byte 31 is a control field of command block 88.

[0080] In operation, data is provided to the host 34 in response to I/O requests generated by the host 34 and communicated to the storage management device 38 over the first communication link 40. To maintain a history of data that was stored in the current storage A40 in the past, in one embodiment, the storage management device 38 uses a copy-on-write (copy-on-write) process when the host 34's I/O request instructs the storage management device 38 to replace existing data with new data. Upon receiving a write request by the host 34, a copy-on-write operation is performed by copying the existing data to be replaced from the current store A44 to the time store A46. The location at which the data is copied from the current store a44 is referred to as the original location. The location where the old (i.e., overwritten) data is stored in time memory a46 is referred to as the destination location.

[0081] In certain instances, the actual copy of the data may not be performed as soon as the write operation occurs, e.g., because the data to be overwritten is already saved (e.g., because it is saved with other nearby blocks) or is not written immediately because the data is saved in memory. Here, copy-on-write operations may mean actual copying, and may include such optimizations that take into account the effects of the copy-on-write. The storage management device 38 remembers the data in the storage unit before it is overwritten and, after the block is overwritten, there is sufficient information in the time store so that the saved data can be obtained from somewhere within the storage management device 38, the physical store, and/or elsewhere. For simplicity of explanation, the examples described below generally present the operation of the storage management device 38 as if the copy-on-write were always performed, provided that optimization can be used in practice.

[0082] In one embodiment, the storage management device 38 indexes each copy-on-write and maintains a record of the original location, the destination location, and the timestamp. In various embodiments, the timestamp comprises the time the data was written to the current memory a44 or the time memory a 46. In another embodiment, the timestamp comprises the time the write request was received and processed by the storage management device 38.

[0083] As an illustrative example, the storage management device 38 provides data storage A to the host 34. In this example, data storage A is a disk volume. In one embodiment, data store A is implemented with current store A44 and time store A46. The storage management device 38 can store each change made to volume a and, in addition, can provide the host 34 with a "prior image" of the volume, i.e., as it existed at a past time. As described above, the storage management device 38 can be accessed using a time specification (specification).

[0084] Typically, because of the large number of I/O requests found in data management systems used in enterprise applications, each prior image of data store A will include at least some data from time store A46 in those applications. For example, if at the present time T, the host 34 requests a previous image of the data store A at a time in the past T-100, the storage management device 38 reviews its index and determines the units of storage on the data store A that have been updated between the time T-100 and the present time (T). The host 34 receives at time T-100 the previously mapped data from data store A, which includes the memory locations from current store A44 that have not been updated since T-100, as well as those memory locations that have been updated since T-100, and memory locations from time store A46, which time store A46 represents data store A at time T-100.

[0085] As another example, at a current time T, the host 34 requests an image of data store A from a previous time T-30. In response, the storage management device 38 generates a prior image of T-30 using the data present in the current store A44, provided that the storage unit has not been updated since the request time T-30. However, for each record that has been updated since the time of request T-30, then the data from the current store A44 is combined with the data from the time store A46. For example, if the data stored in block 100 of current store A44 was written once since request time T-30 (e.g., at time T-20), the old data transferred from current store A44 to time store A46 due to the copy-on-write command occurring at time T-20 would be found to be located at a particular address in time store A46. That is, the data in time store A46 is indexed using its location and a timestamp indicating that it was written at time T-20. Because this is the only point in time since T-30 that block number 100 was written to, the location identified by block 100 and time T-20 stored in time store A46 is the representative data for block 100 that will be provided to host 34 at the time the image of data store A was created at time T-30.

[0086] Referring to FIG. 3, in a much more simplified illustrative example, the storage management device 38 presents volume A as including five storage units, shown for simplicity as 100 byte blocks, block 100, block 200, block 300, block 400, and block 500. In this example, five updates of data store A occur between the current time T and the past time. The past write times are shown in this example and for simplicity these times are identified as times T-60, T-48, T-33, T-29, and T-15. In this representation, time T-60 is 60 units (e.g., seconds, milliseconds, microseconds) before time T. In a practical implementation, the units will be small increments of time, so these numbers (i.e., 60, 48, 33, 29, 15) will likely be much larger.

[0087] In this example, block 100 is updated at time T-60. Block 300 is updated at time T-48. Block 200 is updated at time T-33 and again at time T-29. Block 400 is updated at time T-15. As described above, prior to writing to block 100, the information of block 100 will be read and stored in time store 46 for volume A. The same copy-on-write operation occurs for other blocks. Thus, time store A46 would include five records corresponding to data copied from current store A44 before a write request was directed to current store A44.

[0088] In one embodiment, the storage management device 38 uses the location of the storage unit (e.g., block 100, block 200, etc.) and the timestamp associated with the execution time of the copy-on-write to index each record stored in the time store A46. Thus, by giving data from time store A46 for block 100-400 and data in current store A44 for block 500, a previous image of data store A can be generated at a time prior to T-60 because block 500 was not updated between the previous time T-60 and the current time T. Likewise, if a view of data store A at time T-35 is desired (i.e., a prior image), current store A44 may provide three blocks, namely block 100, block 300, and block 500, because none of them changed after time T-35. Blocks 200 and 400 have been modified since time T-35 so those blocks may be provided to volume A by time store 46.

[0089] Thus, as demonstrated in this simplified example, by saving the data on the volume in the time store 46 before it is overwritten, and indexing the data stored in the time store 46 as it is overwritten, the system can obtain a complete current version in the current store 44, and have an image of the data on volume A at the time interval at which the data exists in the time store 46. The storage management device 38 may present a "virtual" volume that reflects the original volume at some time in the past. Further, the storage management device 38 may provide virtual volumes from any time in a substantially continuous time interval, "substantially" continuous because of the quantization limit defined by the minimum time increment. The virtual volume need not be generated prior to the request time.

[0090] In one example implementation, if the example volume is referred to as volume A, another volume, volume B, may be provided based on a "prior image" of volume A, which is the contents of volume A at an earlier time. This data from volume B may be copied from the previous image of volume a onto the new volume such that volume B is a complete copy of volume a at the previous time. Volume B may also remain "virtual," meaning that volume B may exist only as a combination of current storage A44 and time storage A46, while storage management device 38 provides data from current storage 44 or time storage 46 in response to access to volume B.

[0091] Referring to FIG. 4, it is possible, for example, to provide a current image of volume A, a previous image of volume A at a certain time (e.g., time T-3000), and a previous image of volume A at another time (e.g., time T-6100). This is because these prior images are "virtual," so the storage management device 38 can provide both virtual prior images 48, 50 simultaneously.

[0092] The hosts 34 and the storage management device 38 may use one or more of a variety of protocols to reference a prior image of the data store. For example, the host 34 may request in an out-of-band communication: the storage management device 38 makes available virtual data storage as a previous image of another volume. The host 34 may request the storage management device 38 to make a new volume available in-band communication, for example, using an existing protocol or an extension of the existing protocol. The system administrator may also operate a console or control panel of the storage management device 38, or provide input to the storage management device 38 instructing the storage management device 38 to make one volume available as a virtual image of another volume. In some implementations, the new volume may be assigned a volume or device identifier (e.g., a SCSI ID or a fibre channel world wide name).

[0093] Thus, in one embodiment, a storage management device receives a request to create a virtual data store that reflects the state of the original data store at a specified time. The virtual data store may be, for example, a new logical unit. The specified time may be selected from a substantially continuous time interval between a past time and a current time. The size of the interval (and the value of the past time) is a function of the size of the time store and the amount of change directed to the data store. Virtual data storage, because it is virtual, can be provided substantially instantaneously with minimal or no data movement.

[0094] The storage management device receives a storage protocol request for data at a specified address in the virtual data store, and transmits data stored at the specified address in the raw data store for a specified time in response to the storage protocol request.

[0095] The request to create a new virtual data store may take the form of some manipulation of the user interface. The user interface may be on one or more host systems and communicated to the storage management device, and/or the user interface may be on a console of the storage management device. The request may be communicated through various networking technologies and protocols, and/or through a storage protocol (e.g., the same protocol through which the request for data is generated). The request may even be part of the same storage protocol packet as the request for data. Requests for data from past times may even automatically trigger the provisioning of virtual data storage.

[0096] The request for data may be, for example, a standard read request via a storage protocol, such as a SCSI read request. The request may specify an address, which may include a logical unit identifier and a location identifier. The address may include an identifier of the virtual data store.

[0097] As described herein, the raw data store itself may be a virtual data store. There may be a series of virtual data stores, each formed from a previous image of the other data store.

[0098] As described, because the virtual data store is virtual, it can be provided substantially instantaneously, with minimal or no data movement. However, if there is a constant use of the virtual data store to copy data from the virtual data store, for example, in the background, to another data store, and thereby produce a complete copy of the virtual data store, it is possible. Once the copy is complete, the copy can be used to replace the virtual data store. Such that the prior image can be provided by the virtual data store substantially instantaneously and the time-consuming copying from one data store to another is substantially transparent to a user of the storage management device.

[0099] In another embodiment, the host 34 may communicate with the storage management device 38 using a protocol that allows the host 34 to access the storage unit by referencing an address and time. Thus, a time dimension is added to the access request. Time may be referenced in a variety of ways. For example, the host 34 may reference an absolute time as it is held or an absolute time as it is held by the storage management device 38, such as 4:07.33 for a particular day. A time may also be referenced relatively, that is, it may be specified as a time relative to another time. In one embodiment, time is referenced based on subtracting a number of time units from (and thus, relative to) the current time. This approach does not require the host 34 and the storage management device 38 to have precisely synchronized clocks. Time may be referenced using any applicable unit and may be any applicable unit including, without limitation, nanoseconds, microseconds, milliseconds, seconds, or the like.

[0100] Thus, in one approach, the host 34 (or system administrator) may first instruct the creation of a new virtual volume, volume B, which is a previous image of volume A at time T-3000. Host 34 (or a system administrator) can then instruct the creation of a new virtual volume, volume C, which is a previous image of volume A at time T-6100. The host may thus compare the actual data on volumes A, B and C as needed to determine what files or records, etc., on the volumes have different uses (e.g., forensic purposes).

[0101] In another approach (which may be used additionally or alternatively), host 34 may request volumes using requests that include time conventions in addition to data addresses. The storage management device 38 may respond to the request by providing data at the specified address at the specified time.

[0102] It should be noted that, also in some implementations, current storage a44 may be a mirror image of disk 60 (shown in phantom), or used in any other configuration as one or more actual volumes.

[0103] The time map may also be fixed or dynamic. A fixed time image, also referred to as a clone, is similar to a snapshot of data store A at a particular point in time. It is called fixed because it is not updated, i.e., it is not written with data once it is created. However, the frozen image generated by the storage management device 38 may differ from the snapshot in that the image may be generated at a requested time later than the recovery time for the first time, i.e., the storage management device 38 recreates an image that may not have existed before any time since the recovery time. Instead, the snapshot is a replica that was generated at the current time at that time.

[0104] The dynamic time map is created as a map of the current store a at a particular point in time. However, unlike fixed time images, once generated, dynamic time images will be constantly updated in the same manner as current store A. As a result, the contents of the dynamic time map are the same as the current store A44 until the recovery time. For example, if the first prior image 48 is dynamic, it will match the current store A up to T-3000. Thereafter, beginning at the present request time (T), updates to the current store A are copied over the first prior image 48. The resulting dynamic time map functions as current store B, which includes the results of all I/O requests directed to current store A44, except for those requests that occur between the request time (T) and the recovery time (T-3000). Thus, the current store B also has a time store associated with it, i.e., time store B.

[0105] Referring to FIG. 5, a fixed and dynamic time map is shown. A frozen prior image is a view of a data store at a particular point in time. It is fixed in the sense that it is not updated, e.g. it is read-only. In one embodiment, a time image is frozen by identifying it as a read-only image when it is created. Frozen images can be used to view data stores at a particular time, for forensic purposes (i.e., to identify the cause of a problem), or to recover erased data. The dynamic image begins as a view of the first data store (e.g., data store A) at a particular point in time, but the prior image may be modified. The dynamic image appears to the host as if it were a new data store on which the previous image was copied. Dynamic maps can be used to recover quickly from failures.

[0106] For example, when a failure occurs due to corruption of data in the first data store, frozen prior images (as described above) may be specified, each presenting data in the first data store as it existed at a specified time in the past. These prior images may be examined to determine the approximate time of the corruption. As the minimum timestamp increment is decreased, the approximate time may be determined with increased accuracy. In one embodiment, a prior image showing data from just before the time of corruption is designated as dynamic, a software application using the data in the data store begins using the prior image instead, and transactional activity resumes using the most recent uncorrupted version of the first data store. The application may use the image, for example, by reconfiguring the business application in some manner, or instructing the storage management device 38 to display a dynamic prior image in place of the first current data store, i.e., by using the prior image to create a second data store (e.g., data store B). In one embodiment, the dynamic image appears to the host as a new data store (e.g., a new device with a target identifier).

[0107] In one embodiment, the storage management device 38 provides dynamic images without copying (or initially copying) previous images to another data store. In contrast, the storage management device, as described above, provides a prior image of the first data store by using the current store and the time store associated with the first data store, as the case may be. The storage management device also associates a second current store and a second time store with the dynamic image (i.e., the second data store) such that changes to the dynamic image are stored in the second current store and changed blocks are stored (e.g., in a copy-on-write manner) in the second time store.

[0108] In one embodiment of this implementation, when a request is received for the current data in the dynamic image, the storage management device 38 will first check the data in the second current store, then check the data in the first time store, and finally check the data in the first current store. Upon receiving a write request to the dynamic image, the storage management device 38 determines where the data is currently in the dynamic image (i.e., the second current store, the original current store, or the original time store), stores the rewritten block in the second time store and then writes the new block to the second current store. The second time store, the second current store, the first time store, and the first current store may be used to provide requests for data from a previous image of the dynamic image.

[0109] In another embodiment, the dynamic image is stored entirely in the time store. In this embodiment, the data store has a single current store and a single time store. In a version of this embodiment, the frozen image, the dynamic image, the index information, and the control block are all stored in a time store. The dynamic image may be created by writing data in the data store specifying the recovery time to a segment of the time store. In another version of this embodiment, when the dynamic image is written, a copy-on-write operation is not performed.

[0110] Because the storage management device 38 may (at least initially) provide the dynamic image as a "virtual" device, meaning that the data in the dynamic image is a combination of the data in the first and second current data stores and the first and second time stores, the dynamic image can be provided quickly and without copying the data from one data store to another. Once the dynamic image exists and runs, it can be used (when storage management device capacity permits) to copy the contents of the first current store and/or the first time store to the second current store and the second time store for the dynamic image. In other words, a "virtual" second data store may be used to create a new data store that may be used to independently replace the first data store. This may be done in the background or when transaction (transaction) activity of the storage management device is relatively low. In addition, a background copy operation may be initiated manually or automatically. In one embodiment, the host 34 or a system administrator may initiate a background copy operation and a data storage replacement operation.

[0111] Referring to FIG. 5, as a simplified illustrative example of this embodiment, assume that the dynamic image is created by a first data store, referred to in this example as data store A143. The previous image that is the basis for the dynamic image is designated as data store A143 (again, for example) at a particular time (e.g., 11:00 am). The previous image of datastore A143 may be provided using a current store A144 and a time store A146 associated with datastore A143. Upon indication by the host 34 or system administrator that the prior image should be dynamic (thus allowing modification), the second data store is assigned an identifier, which in this example is data store B147, and current store B148 and time store B152 are assigned to the dynamic image.

[0112] The storage management device 38 responds to a read request to the data store B at the current time by first checking the current store B148, and if the requested block is not in the current store B, then the time store A146 and the current store A144 are available to obtain the block as if it was at the time of the previous image, which was the basis of the dynamic image. To use the data from the previous image of data store A143, the index of data store A143 is checked to determine if the current store A144 or time store A146 contains the required blocks.

[0113] The storage management device 38 reads the target block in response to a write request to data store B (for the current time) by locating the current contents of the target block (e.g., first checking current store B148, then time store a146, then current store a144) in the manner just described for the read request, and then writes the read data to time store 152 to complete the copy-on-write operation. Data associated with the write request to the target block is written to current store B148.

[0114] A read request to data store B147 for the past time may be responded to by first checking time store B152. For example, the index of time store B152 may be checked to determine if it contains the desired block. If not, the current store B148 is checked, and if the block is not in the current store B, then the time store A146 and the current store A144 are available to obtain the block as it was when the block was previously imaged, which was the basis for the dynamic image. That is, the index of time store A146 is checked to determine if it contains the desired block of the desired time, and if not, the block in current store A144 is used. It should be understood that the order in which the indices of time store A146 and current store A144 are checked may be reversed. Alternatively, a composite index of time store A146 and current store A144 may be used.

[0115] It should be noted that data store A143 may continue to be active data store and there are ongoing transactions to data store A143, but those later changes will not be mapped in data store B147 because the storage management device 38 will continue to access data store A143 at a particular past time (i.e., a previous image) in order to access data store B147, and the blocks that later changed in current store A144 will be saved in time store A146 and will therefore not be lost. In practice, the size of the past time interval that can be captured by the time store will depend on the frequency of write operations directed to data store A143 and the size of time store A146. Depending on the particular implementation, it may therefore be advantageous to copy a prior image (e.g., data store A at 11:00 am in the above example) that is the basis for the dynamic image to another data store, or to time store B152 and current store B148, at some time after the dynamic image begins to be used. As mentioned, this transfer may be done in the background while the storage management device 38 is functioning properly.

[0116] In one embodiment, transferring the previous image block to the current store B148 is accomplished by the following steps for a specified recovery time. If a block in current store A144 has not been overwritten since the recovery time (i.e., if the block in current store A144 is the same as the previous image underlying data store B147), and if the block is not already contained in current store B148 (i.e., if the block has not been "overwritten" in the dynamic image since the time the auto-state image was created), then the block is copied from current store A144. If a block represents data that is present in a block of data store A143 at recovery time, and if the block has not been found in current store B148 (i.e., the block is not "overwritten" in the dynamic image), then the block is copied from time store A146 to current store B148. Optionally, blocks in time store A146 from a time prior to the prior image may also be copied from time store A146 to time store B152 so that data store B147 may respond to requests for data at a time prior to the prior image.

[0117] The dynamic image (e.g., third data store) may be created based on other existing dynamic images (e.g., data store B) such that the data in the third data store is provided by other current stores and time stores (e.g., by data store a and data store B). Such dynamic images can also be generated without copying (or without initially copying) a prior image to another data store.

[0118] For example, the storage management device 38, as described above, may provide a prior image of a dynamic data store (e.g., data store B) by using the original current store (e.g., current store A), the original time store (e.g., time store A), the second current store (e.g., current store B), and the second time store (e.g., time store B) as described in the above example. If this new prior image is designated as dynamic, the storage management device 38 may also associate a third current store and a third time store with the new dynamic image (e.g., a third data store) such that changes to the new dynamic image are stored in the third current store and changed blocks of the third data store are stored (e.g., via a copy-on-write operation) in the third time store.

[0119] Using the example above, the system administrator can again use multiple prior images to identify the approximate (or even exact) time of the data corruption when it is detected in data store B147. The system administrator can thus identify a prior image of data store B that is prior in time to the corruption of the data. For example, we assume that the image is at 1 PM. The system administrator may specify that the image of data store B at 1 PM is a dynamic image, and this new dynamic image will be referred to as data store C. The data memory C153 is allocated a current memory C154 and a time memory C156.

[0120] When a request for current data in the data store C153 is received, the storage management device will first check the data in the current store C154 and then in the current store B148 and the time store B152 when the dynamic image was created. If the data block is not in current store B148 or time store B152, as the case may be, storage management device 38 will obtain the data from time store A146 or current store A144.

[0121] When a write request arrives at data store C153, the storage management device 38 determines the location of the data currently in the dynamic image (i.e., current store C154, current store B148, time store B152, current store A144, and time store A146), stores the overwritten block in time store C156, and then writes the new block to current store C154. The time store C156 and the current store C154 may be used in conjunction with the current store B148, the time store B152, the current store A144, and the time store A146 as appropriate to provide requests for data from prior images of the dynamic image.

[0122] Referring to FIG. 6, in another example, which presents a timeline 190, the topmost horizontal line represents data store A from an initial time T1 to a subsequent time T5, i.e., timeline 192. Throughout the time period T1 to T5, the host 34 directs I/O requests to data storage. Data store a is first used and in this example, the application directs read and write transactions to data store a.

[0123] At time T3, the system administrator confirms that there has been a corruption in data store A143 that is likely to result from a corruption event. The system administrator determines when data corruption occurred by identifying the most recent time that the data was not corrupted to enable review of previous images of data store A143. In other words, a corruption event is likely to occur at the earliest time that corrupted data occurs. The storage management device 38 may be used to implement a search of any past version of data store A143 to determine the time of the corruption event. The accuracy of the time at which a corrupting event is likely to be present is determined, at least in part, by the minimum timestamp increment.

[0124] The validity of the data in data store A143 is checked in a first search conducted to identify the time of the corruption event. The first set of vertical lines appearing on timeline 192 between T3 and T4 provides a simplified example of the points in time at which a search was conducted (i.e., T14, T15, and T16). They represent a search from time T4 back to time T3, at which time T4 the fact that a breach occurred was initially identified. The system administrator, for example, begins the search at time T4 and reviews the data at a first search time T16. The data is destroyed at time T16, so the system administrator reviews the data from earlier points in time, i.e., from times T15 and T14. The data is destroyed at times T15 and T14, so the search continues with the review at time T11. The data is not destroyed at time T11, so the administrator checks time T12, time T13, and time T3. The search continues in this manner until the most recent time at which valid data is identified to be present, which in this example is time T3.

[0125] Various search methods may also be used to guide the search. For example, a larger time increment between the first and second searches may be used in an effort to more quickly determine the time of the corrupting event. Also, the search need not start from the point in time when the corruption is found. For example, if the system administrator knows the approximate time of the corruption event, the search may begin at an earlier point in time. The search may also begin at an earlier time than the destruction event, e.g., T1, T2, etc. For example, a search at a first search time, time T2, will continue to a subsequent point in time until corrupted data is first discovered. It should be appreciated that any search strategy may be applied because the storage management device 38 is able to provide the precision of any version of the data store A143 to the minimum timestamp increment over the interval covered by the time store A146. In one implementation, the time precision is one millisecond.

[0126] In this example, time T3 is designated as the recovery time because it is the desired point in time that is identified because there is no corruption. Of course, the user may select an earlier point in time (before T3) as the recovery time. A second data store, data store B147, is created using data from data store A at time T3. Recall at time T4 that the user identified time T3 as the most recent point in time at which valid data of data store a143 existed. At time T4 (i.e., the request time), the user created data store B147 as a prior image of the first data store, data store A143, at time T3 (i.e., the recovery time). In FIG. 6, timeline 194 is associated with data store B147.

[0127] Data store B147 is a dynamic image; thus, a second current store (current store B)148 and a second time store (time store B)152 are associated with data store B147. Once current store B148 is created, storage management device 38 can make data store B147 available to host 34, and the application can use data store B147 in place of data store A143. Thereafter, the host 34's I/O requests can be directed to data store B147 instead of data store A143. In this example, between times T4 and T5, the I/O requests continue to be directed to data store A143 and data store B147. In another embodiment, data store B147 is a dynamic image consisting of a second current store that is not associated with a second time store. In one version of this embodiment, current store B148 is implemented in a write pool (write pool) where write commands directed to data store B147 cause the newly rewritten data to replace existing data in current store B148, i.e., no record of old data in current store B148 is maintained.

[0128] As previously described, data store B147 can be created without copying the contents of data store A143. Data store B147 can therefore be virtually created immediately, and it can be brought online quickly. Data originally associated with data store B147 is present in current store A144 and time store A146.

[0129] Upon receiving a read request to data store B147 at the current time, the storage management device 38 determines which of current store A144 and time store A146 contains the data for the block being read. The data in current memory A144 will be for all data that has not been written since time T3, and the data in time memory A146 will be for all blocks in current memory A144 that were overwritten after time T3. Once some data has been written to current store B148, a response to a read command directed to data store B147 at the current time may come from current store B147, current store A144, or time store A146. Upon receiving the read request, the storage management device 38 determines which of the current store B148, the current store A144, and the time store A146 contains the data for the block being read. The storage management device 38 will use the data in current store B148 for all requests to the chunks of data store B147 that were written after time T4 (i.e., timeline segments (e), (f), and (g)). The data in current store A144 will be used for all data chunks that have not been written to since time T3 (timeline segments (a) and (b)), and the data in time store A146 will be used for all data chunks on data store A143 that have been written to between T3 and T4 (timeline segment (c)).

[0130] Data store A143 may continue to be in dynamic after time T4, however, changes that occurred to data store A143 after time T4 would only affect the data locations used to respond to requests for blocks in data store B147. Such changes do not affect the actual contents of data store B147. If, for example, the corresponding block 100 of data store A143 has not been overwritten since time T3, then the data source for block 100 of data store B147 is the corresponding block in current store A144. However, if the corresponding block 100 in current store A144 has been overwritten since time T3, e.g., a copy-on-write command was executed on the corresponding block in data store A143, then the data source for block 100 of data store B147 is the corresponding block of time store A146. Of course, the immediately preceding description assumes that block 100 has not yet been the target of a write command since data store B147 was created. In addition, in the case where data store A143 is dynamic, data written to data store A143 after time T4 is processed using a copy-on-write operation, so that time store A146 continues to be used to save newly rewritten data after time T4.

[0131] When a write request is directed to data store B147, storage management device 38 determines where the data is currently in data store B147 (i.e., current store B148, current store A144, or time store A146). The location of the data will be as follows:

[0132]1) if the block in current store B148 has been overwritten since time T4, the data is in current store B148;

[0133]2) if the block in current store A144 has no data written to since time T3, the data is in current store A144; and

[0134]3) if the block in current store A144 is overwritten at any time after time T3, the data is in time store A146.

[0135] It then follows:

[0136]1) if the data is located in current store B148, then the existing data will be read from current store B148 and written to time store B152 (e.g., copy-on-write). New data will be written to current memory B148. In one embodiment, updates to current store B148 may be accomplished without the use of a copy-on-write operation or time store B152. In one version of this embodiment, the old data is not saved when the write command is directed to current store B148.

[0137]2) if the data is located in current store A144, then the existing data from current store A144 will be copied and written to time store B152 without overwriting the existing data in current store A144. New data will be written to current memory B148.

[0138]3) if the data is located in time store A146, then the existing data from time store A146 will be copied and written to time store B152 without overwriting the existing data in time store A146. New data will be written to current memory B148.

[0139] Upon receiving a read request to data store B147 at the current time, the storage management device 38 determines where the data is currently in the dynamic image by examining the data in current store B148, current store A144, and time store A146. The storage management device 38 will use the data in current store B148 for all blocks of data store B147 that were written after time T4 (i.e., timeline segments (e), (f), and (g)). The data in current store A144 will be used for all data chunks that have not been written to since time T3 (i.e., timeline segments (a) and (b)), and the data in time store A146 will be used for all data chunks on data store A143 that have been written to (in data store A143) between T3 and T4 (timeline segment (c)).

[0140] Any number of additional data stores may also be generated based on the current or previous image of data store A143. For example, an image of data store A143 at time T2 may be created at any time from time T2, e.g., data store D may be created at time T3. The creation of the additional data store may be performed sequentially, in parallel, or independently of the creation of the other data stores based on data store A143. In all cases, the contents of the additional data store appear to be independent of the contents of the other data stores, i.e., its contents depend on the contents of data store A143 at the time of data store creation. Thereafter, the read and write commands directed to the additional data store are responded to with data from the current store A144, the time store A146, and/or the additional data store to which the commands are directed.

[0141] In one embodiment, the storage management device 38 implements a snapshot that allows a user (e.g., a host or system administrator) to substantially instantaneously generate a prior image of the data store. For example, as described in greater detail herein, the architecture of the storage management device 38 provides a detailed index of write commands directed to each data store so that the appropriate data for each block of the data store can be quickly identified and accessed at any time.

[0142] Transient recovery may be performed in more than one way. For example, the transient restoration occurring at time T4 may be a lossless restoration of data store A143 for the required restoration time of time T3. In one embodiment, lossless restoration is achieved by copying the results of write operations performed between times T3 and T4 back into current store A144. In one version of this embodiment, a copy-on-write operation is performed on each block of data store A143 that is written from time T3 to time T4. At recovery time, the data for that block that is currently at time T3 is written to each respective block of data store A143. The data in the current store that was overwritten is copied to time store a 146. As described herein, relevant details regarding data written using a copy-on-write operation are indexed by the storage management device 38. Thus, it is possible to later recover and review the operations performed on data store A143 between times T3 and T4.

[0143] The storage management device 38 may also implement compression recovery because lossless transient recovery operations increase the amount of data that must be stored in the time store. In compression recovery, several selected data are not retained after recovery. In one version of this embodiment, a write operation, rather than a copy-on-write operation, is performed on the block of data store A143 that was updated between T3 and T4. Thus, at recovery time, the current data at time T3 is written to each respective block of data store A143 that was updated between T3 and T4. In another version of this embodiment, a copy-on-write operation is performed, but even if the time store reaches its storage capacity, the data reserved for the period between T3 and T4 is placed at the front of the data queue to be overwritten. For example, data from the periods T3 and T4 may be associated with the earliest portion of the timeline such that when the specified storage capacity of the data store is reached, it will be replaced first.

[0144] FIG. 6 also depicts the creation of a third datastore (i.e., datastore C) that is created from the contents of datastore B147, i.e., datastore C153 from a previously created dynamic image. Here the request time is T5 and the recovery time is T7. Again, the recovery time may be the time before the corruption occurs. The operation of creating data store C153 from data store B147 is referred to as "stacking" because it creates a series of virtual data stores, where each data store is based on a previous image of another data store (or data stores).

[0145] In this example, data store C153 is a prior image based on data store B147, and data store B147 is a prior image based on data store A143. Thus, data store C153 may initially be provided by data stored in any of current store B148, time store B152, current store A144, and time store A146. The storage management device 38 may present an image of data store C153 to the host 34 based on the following resources: 1) current memory B148 will be used for data from chunks that were overwritten between times T4 and T7 but that were not overwritten since time T7 (timeline segments (e) and (f)); 2) time store B152 will be used for data from chunks that have been overwritten since time T7 (timeline segment (g)); 3) current store a144 will be used for data from chunks that have not been overwritten since time T3 (timeline segments (a) and (b)); and 4) time store A146 will be used for data from chunks that were overwritten between times T3 and T4 (timeline segment (c)).

[0146] The current store C154 and the time store C156 will be allocated as described above. Storage management device 38 processes read and write requests directed to data store C153 in a manner similar to the processing described for data store B147. One difference, however, is that in order to locate the contents of data store C153, the number of data stores that must be searched increases to include current store A144, time store A146, current store B148, and time store B152. The process of creating a dynamic image from a previous data store image can be extended as required by the application within the system's storage capacity. For example, a dynamic image may be created from a previous image of data store C153, thereby creating a fourth data store, e.g., data store D. Additionally, the static image may be created from any of the previous images of the data store using the methods described above, such as creating a clone of data store A143 at time T3, and the like.

[0147] Fig. 7A and 7B provide another illustrative example of the operation of the current store and the time store for a given data store. FIG. 7A shows the contents of the current store, and FIG. 7B shows the contents of the time store associated with the current store of FIG. 7A. A timeline is drawn at the top of each figure to indicate an initial time t0, a first write time t1, a second write time t2, a third write time t3, and a final time t 4. The numbers 0-5 appearing on the left side of fig. 7A and 7B identify six blocks of data storage. As mentioned, the data store may be comprised of any number of blocks or other storage units. In addition, the data storage may be implemented as any type of resource that stores digital data, including virtual disks, logical disks, physical disks, and so forth.

[0148] The data stored at each point in time is packed into a solid box. Each of blocks 0-6 of the current memory has a corresponding block in the time memory. When a write request is directed to a block, the data being written is loaded into a dashed box adjacent to the corresponding block of the current memory in diagram a. This means that the data transferred to the current storage is suspended when the copy-on-write command is completed.

[0149] In operation, at time t0, for example, data a, b, c, and d, respectively, exist in each of blocks 0-3 of the current memory. At this time, blocks 4 and 5 do not contain any data. In addition, the time store does not contain any data because write requests to blocks 0-5 have not yet been directed to the current store. At time t1, data X, Y and Z are written to blocks 2-4, respectively. Copy-on-write operations are performed on each of blocks 2-4 and the old data present in those blocks is read from the current store and written to the time store, i.e., data c, d and an empty block are written to blocks 2-4 of the time store, respectively. As shown in the current memory at time t2, after the write operation at time t1 completes, the newly written data appears in blocks 2-4. However, a second write operation is performed at time t2, at which time data 7, 9, and 8 are written to blocks 0, 3, and 5, respectively. Copy-on-write is performed again, with the result that old data a, Y and an empty block are written to blocks 0, 3 and 5, respectively. At time t3, a third write operation is performed and data Q is written to block 5. The original data 8 that was previously written to block 5 at time t2 is read and written to block 5 of the corresponding time memory. New data Q is written to Block 5 at time t3, so that data Q appears in Block 5 of the current memory at time t 4. If a write operation is not performed at time t4, the time memory will remain empty at time t 4.

[0150] The time store of fig. 8 is based on the order of copy-on-write operations performed on the data store as shown in fig. 7A and 7B. FIG. 8 demonstrates how a prior image of the current store is generated at request time t4 for use in restoring the image representing the data store at recovery time t 1. Because no write operations are performed on blocks 0, 1, and 5 at time t0 or time t1, previously mapped blocks 0, 1, and 5 are composed of data from the current memory at time t 1. Because the data was written to blocks 2, 3, and 4 at time t1, the data from the time store is used as a prior image of blocks 2, 3, and 4 at time t 1. Thus, the prior map of the data store at time t1 does not reflect the results of the change in the current store that occurred after time t 1.

[0151] Referring now to FIG. 9, in one embodiment, the storage management device 238 includes one or more processor modules 278, 278', 278 ", generally 278. Although 3 processor modules 278 are shown in the figures for illustration purposes, there may be any number of processor modules 278.

[0152] Each processor module 278 includes a central processor 290, the central processor 290 communicating with each of a target interface 292, a ROM294, a memory 296, and an initiator interface 298. CPU290 may be implemented in one or more integrated circuits and may include other "glue" logic (not shown) for interfacing with other integrated circuits, e.g., bus interfaces, clocks, and communication interfaces. The CPU290 implements software provided in ROM294 and in memory 296, which may be accessed, for example, through the internal network interface 284 or in the physical storage 36.

[0153] The CPU290 also communicates with an internal network interface 284, the internal network interface 284 connecting the processor modules 278 to an internal network 286 that allows the processor modules 278 to communicate with each other. Internal network 286 may be implemented as one or more actual networks, and may be any type of network having sufficient capacity to allow the transfer of both control information and data. The internal network 286 may include a shared serial or parallel bus, or some combination. The internal network may be or include any type of physical network implementing a remote direct memory modeling interface (e.g., InfiniBand, ethernet, fibre channel, SCSI, etc.). In one embodiment, the interface is a random Access Provider Library (DAPL).

[0154] In one embodiment, the processor module 278 plugs into a backplane that enables connections for the internal network 286. In one implementation, one or more sets of processor modules 278 are rack-mounted (rack mount) within the storage management device 238, and the internal network 286 also connects each rack (rack) to other racks within the storage management device 238. The distributed processing implemented in the storage management device 238 creates a system whose size (e.g., storage capacity, processing speed, etc.) can be easily scaled up or down to fit the required capacity.

[0155] The target interface 292 provides an interface that enables the processor module 278 itself to appear as one or more target data storage devices. For example, if the target interface 292 is a fibre channel interface, the target interface 292 allows the processor module 278 to present one or more fibre channel devices to a host (not shown). The target interface 292 may implement any suitable networking communication or data storage protocol. The target interface 292 may be implemented with one or more integrated circuits that preferably have direct storage access to portions of the memory 296 for storing received data and data to be transmitted. Typically, target interface 292 will require initialization and programming performed by CPU 290.

[0156] The initiator interface 298 provides an interface that allows the processor module 278 itself to appear as one or more hosts for communicating with physical data stores. For example, if the initiator interface 298 is a fibre channel interface, the initiator interface 298 allows the processor module 278 to communicate with one or more physical storage devices over the fibre channel interface. Initiator interface 298 may implement any suitable networking communication or data storage protocol. The initiator interface 298 may be implemented using one or more integrated circuits that preferably have direct storage access to portions of the memory 296 for storing received data and data to be transmitted.

[0157] The processor modules 278 may be implemented in a fault tolerant configuration in which two processor modules 278 are each responsible for responding to I/O requests directed to the same unit of storage. In a version of this embodiment, fault tolerance may be further enhanced by sharing I/O requests to storage units containing a single physical or logical device (or volume) by multiple pairs of processor modules 278. For example, the first and second processor modules 278 may be allocated blocks 100 and 200 responsible for current store A, while the third and fourth processor modules 278 may be allocated blocks 300 and 500 responsible for current store A. Fault tolerance may be further improved by locating processor modules 278 that perform the same tasks in separate racks.

[0158] Referring now to FIG. 10, in the functional description of the system elements, again, three processor modules 378, 378', 378 ", generally 378, are shown in the storage management device 338. The number of modules 378 is (again) merely illustrative, and the number of processor modules 378 may be increased or decreased in view of scalability, performance, and cost. The functional elements shown on each processor module 378 may be implemented using hardware and/or software; typically, both hardware and software are used to implement each of these elements.

[0159] In one embodiment, each processor module 378 of the storage management device 338 includes at least one host interface 361, the host interface 361 for communicating with a host, the I/O manager 362, the storage buffers 363, and the physical storage interface 364. In another embodiment, each processor module 378 includes more or less of these functional elements. In various embodiments, the storage management device 338 also includes an internal network 380 (e.g., an internal InfiniBand network, an internal Ethernet network, an internal fibre channel network, and/or an internal SCSI network), the internal network 380 being used to enable communication between functional elements of a single processor module 378 (e.g., the host interface 361, the I/O manager 362, the storage buffer 363, and the physical storage interface 364), to enable communication between any functional element of a first processor module 378 and any functional element of a second processor module 378, to enable communication between one or more components of the same functional element (e.g., to enable communication between the target mode driver 382 and the data classifier 384 of the host interface 361), and to enable communication between a component of one functional element and another functional element (or a component of the other functional element), whether on the same or a different processor module 378.

[0160] In one embodiment, the host interface 361 includes a target mode driver 382, the target mode driver 382 including a target interface 292 (see FIG. 9) and software for communicating with the target interface 292. Functionally, the target mode driver 382 communicates with the host 34 over any type of communication link 40 (e.g., a fibre channel network) as described above. Thus, the target mode driver 382 receives and responds to incoming I/O requests from the host 34.

[0161] In one embodiment, the target mode driver 382 receives an I/O request that includes control information such as, for example, a request to also include a write operation, a read operation, or a history of modifications to locations within the data store as described below, of a data payload. The target mode driver 382 may obtain the requested data from the I/O manager 362, for example, in response to a read operation, and may thereafter pass the requested data to the host 34. In response to a write operation, the target mode driver 382 initially stores the received write operation in a first storage buffer 363, the first storage buffer 363 being located on the same processor module 378 as the target mode driver 382. In one embodiment, the target mode driver 382 then separates the write operation into its associated control information and data payload such that both the control information and the separated data payload are initially stored in the first memory buffer 363. In one embodiment, the I/O request is separated into a data payload and a control packet by the host interface 361. The control information may then be sent to other components within the storage management device 338 via the internal network 380. For example, in one embodiment, the target mode driver 382 sends control information to the data classifier 384. For which the data payload, or a copy thereof, may also be sent to other components within the storage management device 338 via the internal network 380. Eventually, the data payload will be passed through the internal network 380 to the appropriate physical storage interface 364, as directed by the I/O manager 362. Preferably, the data payload is transferred via hardware direct memory access without requiring software processing.

[0162] In one embodiment, the target mode driver 382 timestamps the control information before sending the control information to the data classifier 384 and before acknowledging received I/O requests to the host 34. In other words, the target mode driver 382 associates the control information with the time at which the control information is received at the host interface 361. For example, where the target mode driver 382 sends control information to the data classifier 384 in a data packet, the target mode driver 382 may use a field within the data packet to indicate a time at which the control information was received at the host interface 361. Any other method of time stamping the control information may also be used.

[0163] In one embodiment, after the target mode driver 382 separates the data payload of the write operation from the control information of the write operation, and in addition to the target mode driver transmitting the control information to the data classifier 384, the target mode driver 382 replicates the separated data payload to create at least one data payload copy. In one embodiment, the target mode driver 382 then evaluates a first cost equation (cost equalisation), as described below, and based on the result of the evaluated first cost equation, optimally identifies the second storage buffer 363 that is capable of at least temporarily storing the first data payload copy. In one embodiment, the first storage buffer 363 and the second storage buffer 363 are different storage buffers 363, for example, in different processor modules 378. Optionally, the target mode driver 382 may then also evaluate the second and/or further cost equation(s), as described below, and based on the results of the evaluated second and/or further cost equation(s), may optimally identify a third and/or further storage buffer(s) 363 capable of storing the second and/or further data payload copy. The first, second, third and any further memory buffers 363 may each be a different memory buffer 363. The target mode driver 382 may then send the first data payload copy to the second storage buffer 363, and optionally, may send the second and/or further data payload copies to the third and/or further storage buffers 363. Thus, the storage management device 338 may provide for redundant storage of data, whether temporary or permanent.

[0164] In one embodiment, the host interface 361 also includes a data classifier 384. The data classifier 384 is in communication with the target mode driver 382 of the host interface 361 and is also in communication with the plurality of I/O managers 362. The data classifier 384 receives control information for the I/O request from the target mode driver 382, identifies the appropriate processor module 378 to respond, and forwards the control information to the I/O manager 362 of the appropriate processor module 378.

[0165] In one embodiment, the data classifier 384 classifies I/O requests received by the target mode driver 382 at the host interface 361 as being of a particular type (e.g., write operations, read operations, or requests for a modification history). In one embodiment, the data classifier 384 analyzes control information of the received I/O request to classify the I/O request. The data classifier 384 also classifies the control information by comparing incoming I/O requests to generated subscription (subscription) requests, such as those generated by the I/O manager 362 as described below. In one embodiment, the data classifier 384 determines a processing group, a storage identifier (e.g., logical unit), a storage unit identifier, and a length for each I/O request. This information is passed to the appropriate I/O manager 362 along with control information, timestamps, and I/O request types. To allow processing of a large number of I/O requests, the storage buffers 363 temporarily store these packets of information from the data classifier 384 as they are sent to the respective I/O manager 362.

[0166] In more detail, the plurality of I/O managers 362 of the storage management device 338 are all responsible for managing data storage. In one embodiment, each of the plurality of I/O managers 362 subscribes to at least one set of locations within the data store for which it will process control information received from the data classifier 384, via a subscription protocol (e.g., as described below). Thus, when the control information of an I/O request received at the host interface 361 includes an operation to be performed at a first location within the data store, the data classifier 384 may identify a first one of the plurality of I/O managers 362 capable of processing the control information based on the subscriptions of the plurality of I/O managers 362. Further, in one embodiment, if a first one of the plurality of I/O managers 362 fails, the data classifier 384 may also identify a second one of the plurality of I/O managers 362 capable of handling control information, again based on the subscriptions of the plurality of I/O managers 362.

[0167] In one embodiment, after the data classifier 384 receives the control information from the target mode driver 382, the data classifier 384 replicates the control information to create a copy of the control information. In one such embodiment, the data classifier 384 transmits the control information to the first one of the plurality of I/O managers 362 identified as described above and instructs the first I/O manager 362 to process the control information. The data classifier 384 may also send a copy of the control information to a second one of the plurality of I/O managers 362 identified as described above, and may instruct the second I/O manager 362 to temporarily store the copy of the control information instead of processing the copy of the control information. For example, a copy of the control information may be stored in the storage buffer 363 of the processor module 378 with a second of the plurality of I/O managers 362 residing in the storage buffer 363 of the processor module 378. Thus, in one embodiment, the storage management device 338 maintains redundant copies of control information so that in the event of a failure of a first I/O manager 362, the control information can be processed by a second I/O manager 362.

[0168] In one embodiment, the control information of the first I/O request directs the I/O manager 362 to operate at a first location within the data store. In such an embodiment, the control information for other I/O requests may also direct the I/O manager 362 to operate at a second location within the data store that at least partially overlaps the first location within the data store. In this case, the I/O manager 362 processes the control information with the earliest timestamp first. Thus, in one approach, by time-stamping the control information for I/O requests, the target mode driver 382 can effectively ensure that, when those other I/O requests are directed to a location within the data store that at least partially overlaps with the first location within the data store, the I/O manager 362 processes the control information for any one particular I/O request for the first location within the data store before it processes the control information for other I/O requests having a later time-stamp.

[0169] Once the I/O manager 362 receives the control information and is instructed by the data classifier 384 to process the control information, the I/O manager 362 orders and manages the I/O requests and forwards the appropriate instructions to the physical memory interface 364. In one embodiment, the I/O manager 362 processes control information and monitors and indexes the flow of information within the storage management device 338. For example, the I/O manager 362 monitors and indexes the flow of information to and from other processing modules, the host interface 361, and the physical store 364. The I/O manager 362 also manages I/O and ensures that modified units of storage are preserved and accessible for creating future references in previous images. In addition, the I/O manager 362 tracks the performance (e.g., response time) of the storage management device 338 in responding to I/O requests from the hosts 34.

[0170] The I/O manager 362 can also execute various optimization routines to provide hosts with efficient response times to I/O requests. For example, because the storage management apparatus may be used in very large capacity storage systems 30, including storage systems having gigabyte (terabyte) storage capacity, optimization of copy-on-write commands may be desirable. Copy-on-write commands may require at least two consecutive operations before writing new data to a target storage address: (a) reading the existing data from the target memory address and (b) writing the existing data to the new memory address. In one embodiment, the storage management device implements certain optimizations, either alone or in combination. These optimizations typically fall into one of five categories: (i) polymerizing; (ii) spanning (spanning); (iii) redundant write (redundant write); (iv) reordering; and (v) live storage. Each of these optimizations allows for more efficient processing, especially for copy-on-write operations.

[0171] And (1) polymerizing. The first optimization is polymerization. The storage management device 338 may aggregate separate copy-on-write commands for contiguous storage units (e.g., storage units in adjacent blocks) and perform operations in a single copy-on-write command. This may be useful because the overhead associated with multiple physical disk reads and writes of each block is eliminated when neighboring blocks are operated on as a whole.

[0172] And 2, spanning. The aggregate optimization can be further extended by combining separate copy-on-write commands directed to non-contiguous, but very close to each other, storage units into a single copy-on-write command that spans all storage units located in the span, in addition to all target storage units. For example, where five storage units 100, 200, 300, 400, and 500 are arranged in the order shown with respect to one another, the copy-on-write command directed to blocks 100, 300, and 500 may instead be directed to a single copy-on-write command including block 100 and 500. Although additional data is read and operated on, the spanned block containing the additional data may still be significantly faster than 3 separate disk operations.

[0173] Redundant write. Redundant write optimization may be achieved by identifying a first storage unit that may be the target of a host write request. Data written to the first block may also be written to the second block. The index may track the address of each memory location. The next write command to the block may then cause one of the two blocks to be rewritten instead of performing a copy-on-write. The unchanged block can thus serve as a historical copy of the block.

[0174] And 4, reordering. Using reordering optimization, incoming I/O requests may be reordered to maximize the benefits of one or more of the other optimization protocols (e.g., aggregation protocol, cross-over protocol, redundant write protocol, etc.).

[0175] Live storage. In some cases, high efficiency can be achieved by storing data in memory rather than physical storage. For example, if some blocks have a large number of I/O requests (e.g., they are frequently updated), many read/write operations may be preserved by keeping the data in memory. In one embodiment, the memory is memory 296 (fig. 9) located in the processor module 378.

[0176] The storage buffer 363 may store, at least temporarily, data payloads, data payload copies, control information, and copies of control information being processed within the storage management device 338. In one embodiment, a plurality of storage buffers 363 are in communication with one or more target mode drivers 382. In such an embodiment, the data received by the target mode driver 382, and all copies of that data made by the target mode driver 382, are stored in the one or more storage buffers 363 until it is passed to the physical store 36 through the physical store interface 364 or to another processor module 378 via the internal network 380. The storage buffer 363 includes a memory 296 (see fig. 9), the memory 296 being allocated in such a way as to allow various devices to transfer data without software processing of the data.

[0177] The physical store interface 364 communicates with the physical store 36 over any type of communication link 42 (e.g., a fibre channel network) as described above and with a plurality of I/O managers 362, one or more host interfaces 361, and a plurality of storage buffers 363 via an internal network 380. For example, in response to a read request, the physical store interface 364 retrieves data stored on the physical store 36, which is ultimately provided to the host interface 361 for communication to the host 34. For write requests, the physical store interface 364 forwards the data payload to the target storage location of the physical store 36.

[0178] After the I/O manager 362 has processed the control information for the I/O request initially received by the target mode driver 382 at the host interface 361, the I/O manager 362 can instruct the physical store interface 364 to communicate with one or more physical stores 36. In one embodiment, the I/O manager 362 instructs the physical store 364 to read data from the physical store 36. For example, the I/O manager 362 may have processed the control information for a write operation, and the physical store interface 364 is thus instructed to read data from the physical store 36 in order to perform a copy-on-write operation. Alternatively, the I/O manager 362 may have processed the control information for the read operation, and the physical store interface 364 is thus instructed to read data from a particular location within the physical store 36. When the I/O manager 362 is instructed to read data from the physical store 36, the physical store interface 364 reads such data.

[0179] In another embodiment, the I/O manager 362 processes control information including a write operation of a data payload, but the data payload, which has been previously separated from the control information by the target mode driver 382 as described above, will have been stored in the first storage buffer 363. In such an embodiment, in addition to instructing the physical store interface 364 to communicate with the physical store 36, the I/O manager 362 also instructs the physical store interface 364 to communicate with the first store buffer 363. Thus, as indicated by the I/O manager 362, the physical store interface 364 retrieves the data payload from the first storage buffer 363 and writes the data payload to a location within the physical store 36.

[0180] Once the data payload is securely stored to a location within the physical memory 36, the I/O manager 362 can delete, tag delete, or tag replace one or more data payloads previously (redundantly) stored in the second and/or additional storage buffer(s) 363. Similarly, once the control information has been processed by the I/O manager 362, the I/O manager 362 can delete, tag delete, or tag replace a copy of the control information previously stored in the storage buffer 363 of the processor module 378 on which the second I/O manager 362 is located.

[0181] Referring now to FIG. 11, each processor module 378 (FIG. 10) is responsible for I/O requests made to a particular portion of the data store. Each I/O manager 362 is responsible for managing and fulfilling I/O requests to portions of the data store to which its processing module is assigned. In one embodiment, each I/O manager 362 is assigned a set of blocks of data storage, such as block 100 of data storage A and block 500. Each processor module 378 may employ multiple I/O managers 362. The allocation of the I/O manager 362 to the portion of the data store for which it is responsible occurs through a subscription protocol. In one embodiment, the predetermined protocol may be implemented by having each of the plurality of I/O managers 362 register with each of the data classifiers 384 one or more portions of the data store on which data operations (e.g., read operations or write operations) are to be performed.

[0182] Each I/O manager 362 may be responsible for multiple current stores and multiple time stores that are managed by the current store controller 472 and the functional storage module 474. In one embodiment, the storage management device 338 maintains a database that associates each I/O manager 362 with contiguous groups of blocks assigned to the respective I/O manager 362. The data classifier 384 associated with the I/O manager 362 employs the database to ensure that each I/O manager only performs the tasks associated with the blocks assigned to it. In one embodiment, the method allows a subset of the total number of I/O managers 362 in the storage management device 338 to service a single time store, while other subsets of I/O managers 362 may service additional time stores. This approach is also scalable in that increasing the number of I/O managers 362 will increase the amount of time storage that the storage management device 338 can efficiently service. Also, the method may be used with a single physical store 36 containing multiple time stores and multiple current stores. Because this approach uniquely identifies each data store, only minimal additional information is required to associate each I/O manager 362 with a particular storage unit(s). In one embodiment, the data store block number, time store block number, and timestamp are the only additional information required.

[0183] In one embodiment, the I/O manager 362 maintains a series of tables of control information, each corresponding to a particular time window. For example, all I/Os processed by the I/O manager 362 between 9:00 and 9:05 may be stored in a single table, while I/Os occurring between 9:05 and 9:10 are stored in another table. In one version of this embodiment, the tables are of a fixed size. The fixed table size allows the processing time for each query to the table to be easily determined because all tables are full, except for the table currently being used. Thus, the processing time is the same for all tables except the current table. Although the table size is fixed, the time period covered by each table is variable due to the variable frequency of write commands and the variable size of the target storage unit associated with each command. For example, if on average, the associated I/O manager 362 processes 200,000 write commands every 3000 time units, a table limited to 600,000 entries will fill in 9,000 time units. However, if the associated I/O manager 362 receives 200,000 write commands every 1000 time units, the same size table will fill 3000 time units. In one version of this embodiment, the table includes a data store block number, a time store block number, and a timestamp indicating when the associated copy-on-write operation was performed.

[0184] When the table is filled, the I/O manager 362 does three things:

[0185]1) the I/O manager 362 creates a new table for the new incoming write operation.

[0186]2) the I/O manager 362 creates entries in a separate table (e.g., a master table) that describes and indexes these tables of control information. The master table contains the table name and the time range covered by the table, i.e., from the time the table was created to the time the last entry was recorded in the table. In one embodiment, the master table is local to the I/O manager 362 with which it is associated.

[0187]3) the I/O manager 362 creates a bitmap in a given table that represents all I/Os. The bitmap has one bit for a given block range. The bitmap may be adjusted to adjust the block range represented by each bit; thus, in one embodiment, bit 0 represents blocks 0-15, bit 2 represents blocks 16-32, and so on. The amount of data represented by each bit is referred to as the region size.

[0188] The size of the zone is also adjustable. Thus, the closer the region size is to the average I/O request size or the minimum I/O request size, the less probability of a false positive (false positive) in bits occurs. In one embodiment, the minimum I/O request size is 1 sector or 512 bytes. In operation, if the region size is 128 kilobytes, the first bit will be set if the user writes data to blocks 2-10. However, if the bitmap is later used to determine if block 85 is referenced in the underlying data, the bitmap will provide a false positive indication.

[0189] As the region size decreases, the number of false positives decreases and practically decreases to zero. However, as the area size decreases, more memory and disk space is required to store the bitmap. Conversely, as the region size increases, the number of false positives that occur increases, while the memory requirements of the bitmap decrease. In one embodiment, each I/O manager chooses an area size that dynamically balances the rate of false positives with the bitmap size.

[0190] In one embodiment, when a table reaches capacity and moves to a new table, the impact of the operations required by the I/O manager to close or "seal" the table is minimized because table transfers are performed asynchronously with respect to the continuous I/O flow.

[0191] When a particular recovery time is required for generating a time-based data store (e.g., data store B), the I/O manager 362 must perform three general levels of operation.

[0192]1) the I/O manager 362 first identifies the table that is contained. If the user requests a recovery time of T-500, the I/O manager 362 scans the master table for a table of control information that includes I/O operations that occur between T-500 and the request time. The I/O manager then retrieves a bitmap for each control information table that includes qualified (squaring) I/O operations.

[0193]2) the I/O manager 362 then creates the master bitmap by "OR" all the retrieved bitmaps together and saves the respective bitmaps and the master bitmap. Once the OR operation is complete, the master bitmap may be used to evaluate the true percentage of potential read requests to determine if the requested block was contained in a previous write operation (i.e., between T-500 and the request time). If a block is not involved in the write operation at that time, the data from the current memory will be used for that block. The retrieval and presentation of data from the current store is a substantially real-time operation. If the region bit is set in the master bitmap, the I/O manager 362 begins scanning the bitmaps from oldest to newest to determine which bit is set for the region and then scans the underlying table of bitmaps for the location of the I/O operation in the time memory. These operations are slow compared to retrieving data from current storage, but they continue throughout the system.

[0194]3) the I/O manager 362 begins creating a zone map whereby a copy of the blocks described in each individual control information table is stored in memory. When the operation is complete, the latency of read requests that must go to the time store to obtain the data is reduced because the request is redirected to memory and rarely, if ever, requires any additional table scans.

[0195] The response time of the storage management device 38 may be reduced by the foregoing method because the I/O manager 362 begins servicing requests when the first step is completed. In most applications, the current store will provide the majority of the data required to generate the time store, since, most often, the time store will be generated at a relatively recent point in time, e.g., 1 minute, 1 hour, 1 day. The amount of data that typically varies during those periods is small when compared to the entire data store. Each master table may contain 500,000 to 5,000,000 records, each table still being searchable in a fixed time. A master table having only a few thousand entries may be used in applications that support 2 gigabytes of physical storage 36.

[0196] Referring to FIG. 11, the current memory controller 472 processes requests directed to the device/storage unit combination to which the current memory controller 472 is subscribed. Each current memory controller 472 receives the resulting control information, which is sent from the host interface 361 (FIG. 10) to the I/O manager 462 over the control plane 568 (FIG. 12). The current storage controller 472 creates a work order based on this control information to ensure that the data associated with the control request is written to the logical unit and that the old data that is currently present at the target location is copied by the storage management device 538 and saved elsewhere.

[0197] Similarly, the time store controller 476 handles requests directed to the device/storage unit combination to which the time store controller 476 is subscribed. Each subscription is registered using the data classifier 384 of the processor module 378.

[0198] The I/O manager 362 also includes an I/O router 470. I/O router 470 is a software module responsible for moving data as directed by current memory controller 372 and time memory controller 376.

[0199] Although only one of each of I/O router 470, current storage controller 472, functional storage 474, and time storage controller 476 is shown, I/O manager 362 may include one or more of each of these devices. Furthermore, these elements may communicate in other configurations than the configuration shown in FIG. 11. For example, in one embodiment, I/O manager 462 includes a plurality of time store controllers 476.

[0200] Referring now to FIG. 12, in another embodiment and more abstract representation, the storage management device 538 includes a data plane 566 and a control plane 568 for communicating with each other between the various modules. The storage management device 538 includes a plurality of host interfaces 561, I/O managers 562, and physical storage interfaces 564. While these components, as shown in the previous figures, are each located on a particular processor module, they may be collectively viewed as a collection of these components, working together to share the load for efficiency and fault tolerance considerations.

[0201] The host interface 561 and the physical memory interface 564 communicate data to each other over a data plane 566, which data plane 566 is implemented using direct memory access and internal networks 380 (FIG. 10), as described above. On the control plane 568, control information (e.g., control packets, metadata packets) is passed between the host interfaces 561 and the I/O managers 562, and between the I/O managers 562 and the physical memory interfaces 564. The control plane 568 is implemented by employing an intra-processor communication mechanism and using the internal network 380 (FIG. 10). Data payloads pass between the host interface 561 and the physical memory interface 564 via the data plane 566.

[0202] The optimization described above is accomplished, in part, due to the queuing system employed by the storage management device 338. The queue system organizes control information (e.g., control packets, metadata packets) that is processed by the I/O manager 362. The control information is first subject to an incoming queue where the I/O manager 362 queues the control information in the order it was received.

[0203] In one embodiment, control packets are combined, reordered, and/or strategically delayed in order to more efficiently process the packets. Referring again to FIG. 10, the I/O manager 362 identifies and tracks an idempotent group of control packets, i.e., a group of control packets that are independent of each other. Generally, idempotent groups can be processed more efficiently than other packet groups, e.g., idempotent groups can be processed more quickly. For example, if the first control packet directed to blocks 0-15 arrived at time T0 and the second control packet directed to blocks 8-31 arrived at time T5, the I/O manager 362 includes all operations from T0 to T4 in an idempotent group and begins another group at time T5 (provided that there is no other control packet overlap between T0 and T5). In this example, the processing, clustering, and execution order are selected to prevent the T5 operation from occurring before the T0 operation. For example, if the T5 operation were first performed, the T0 operation would include a portion of the T5 payload in its prior image (i.e., blocks 8-15). In addition, the T5 operation will lose data from the T0 operation in the map before it, although this data is present at time T1.

[0204] The storage management device 338 creates many opportunities to create customized groups of control packets that improve processing efficiency because, for example, operations may be divided into "worker groups" (worker groups) where each worker group may run in a threaded, independent, and simultaneous manner. Determining that certain blocks are not idempotent as described above forces the I/O manager 362 to ensure that all block references 0-32 occur in the same group of workers as the T0 and T5 operations, but that operations that include other large groups of blocks may still be reordered. Thus, the I/O manager 362 continuously identifies, analyzes, and manages idempotent relationships across multiple queues using advanced queuing theory.

[0205] The system allows the user to create a new dynamic or static data store B, which is the main data store a represented at a previous point in time, e.g., at T-500. The target mode driver 382 creates a target device representation on the first communication link 40, the link 40 allowing the host 34 to issue commands to the new data store B. The I/O manager 362 uses the functional storage 474 to create a map of all blocks that cannot satisfy data store B through current store a, that is, the blocks have been overwritten in current store a since recovery time T-500. The mapping is continuously updated due to the continuous I/O flow directed to primary data store A. For example, each time a user modifies a block of data store A, the target block in current store A no longer contains the same data as it contained before time T-500. The mapping incorporates the location in time store a to which the new target block was copied. Thus, an I/O request directed to data store B locates the correct block contents. In addition, the entire process must occur simultaneously to ensure that updates to the current store A, time store A, are accurately reflected in the mapping of data store B, thereby preventing I/O requests to data store B from identifying the wrong block as the data source. For example, when a new block is written to data store A, the mapping is updated using the location in time memory of the previous contents of data store A. The storage management device 538 employs a method to ensure that I/O requests directed to data store B are subsequently locating the correct data in a timely manner.

[0206]Modifying history requests

[0207] In general, in another aspect, the invention features systems, methods, and articles of manufacture for providing a modification history for a location within a data store. In brief overview, in one embodiment of this aspect of the invention, a first computing device (e.g., a host as described above) assigns a location within a data store (e.g., an address range within the data store) managed by the second computing device to a second computing device (e.g., a storage management device as described above). The first computing device then also requests, from the second computing device, a list of times at which at least a portion of the data stored at the specified location was modified. In one embodiment, the second computing device then responds with a list of times at which certain portions of the data stored at the location were modified, and optionally identifies which portions of the location were modified at those times. Generally, if some portion of the data stored at that location has been modified, it will have been modified due to the write operation directed to that portion of the data store.

[0208] In one embodiment, a request for a modification history for a location within a data store is received in-band to a second computing device, i.e., the request is from a first computing device and via the same communication protocol that the first computing device uses when it communicates data commands (e.g., read operations and write operations). In another embodiment, the request is received out-of-band from the second computing device. For example, the request is received via a communication protocol different from the communication protocol used when the first computing device communicates the data command, via a different channel (e.g., via a user interface, such as a graphical user interface, or a command line on a console of a computing device different from the first computing device, such as, for example, a second computing device or another computing device, such as an administrator's computing device or a computing device located at a third party control center), or some combination thereof.

[0209] This aspect of the invention may be useful, for example, if a user (e.g., a system administrator) notices a problem with data stored in the data store. The problem may be, for example, that the data is corrupted due to a software or hardware malfunction, or, as another example, that the data is overwritten by an application due to an administrator error. Upon determining the relevant location(s) for the problem, the administrator may query the device to determine the time at which the location(s) was last modified. Using this information, the administrator may then request that the data storage device present a prior image of the data store at a time prior to each indicated time. In this way, the user is likely to identify the most recently available prior image in which the corrupted data is intact.

[0210] Some applications and operating systems, for example, upon detecting certain errors in a data store, provide information about the particular data store location in which the error was detected in order to facilitate debugging. When such location information is provided directly by the application, the query described above may be completed using the location information. As another example, some applications and operating systems report errors associated with a particular file. Typically, operating system and/or file system tools may be used to determine the data storage locations allocated to those files by the operating system and/or file system. If the data store provided to an application program (or operating system, device, etc.) is virtualized, the data store locations provided by the application program (or operating system, device, etc.) may need to be translated (e.g., de-virtualized) to identify the respective associated locations in the data store when rendered by the data store.

[0211] In one exemplary embodiment, a user of the data store is notified of problems encountered by an application, such as a database application. The user determines the location(s) of the problem directly from the application, or indirectly using information provided by the application or operating system. For example, a user may make this determination by analyzing an application-specific or operating system-maintained error log using a software-based tool to facilitate the de-virtualization of I/O errors. The user then directs a query to the storage device to determine the time at which the location(s) was last modified. The query may be performed, for example, using an application program, using a software-based tool that is otherwise provided on the user's computer, or directly to the storage device using a control panel, console, or other tool. The user receives the modification history (via a tool, etc.). The user then requests the storage device to present one or more prior images (e.g., one at a time or all together) at respective times prior to the reported modification time. The user may then examine each prior image to identify the most recently available prior image in which the problem does not exist. For example, the user may then copy data from the prior image to the data store, begin using the prior image, or take other action.

[0212] FIG. 13 illustrates one embodiment of a storage system 630 that can provide a modification history in accordance with this aspect of the invention. The storage system 630 includes a host 634, a storage management device 638, and physical storage 636. The host 634 and the storage management device 638 communicate with each other via a first communication link 640. The storage management device 638 and the physical store 636 communicate with each other via a second communication link 642. In general, the host 634, the storage management device 638, the physical storage 636, and the first and second communication links 640, 642 may possess the capabilities of, and may be implemented as, the host, the storage management device, the physical storage, and the first and second communication links, respectively, with the additional functionality described herein. It should be understood that other implementations are possible.

[0213] In one embodiment, the host 634 includes at least one host receiver 681 and one host transmitter 683. The host receiver 681 and the host transmitter 683 can each be implemented in any form, method, or manner useful for receiving and transmitting communications, such as, for example, requests, commands, and responses, respectively. In one embodiment, the host receiver 681 and the host transmitter 683 are implemented as software modules having hardware interfaces where the software modules are capable of interpreting communications, or the necessary portions thereof. In another embodiment, the host receiver 681 and the host transmitter 683 are implemented as a single host transceiver (not shown). The host 634 uses a host receiver 681 and a host transmitter 683 to communicate with the storage management device 638 over the first communication link 640.

[0214] In one embodiment, the storage management device 638 includes at least one storage management device receiver 687, a determination module 689, and a storage management device transmitter 691. Again, the storage management device receiver 687 and the storage management device transmitter 691 can each be implemented in any form, method, or manner useful for receiving and transmitting communications, such as, for example, requests, commands, and responses, respectively. For example, like the host receiver 681 and the host transmitter 683, the storage management device receiver 687 and the storage management device transmitter 691 can also be implemented as software modules having hardware interfaces where the software modules are capable of interpreting communications, or the necessary portions thereof. In one embodiment, the storage management device receiver 687 and the storage management device transmitter 691 are implemented as a single storage management device transceiver (not shown). The storage management device 638 communicates with the host 634 over the first communication link 640 and/or with the physical storage 636 over the second communication link 642 using the storage management device receiver 687 and the storage management device transmitter 691.

[0215] To that end, the determination module 689 may be implemented in any form, method, or manner that is capable of performing the functions described below. For example, the determination module 689 may be implemented as a software module and/or program, and/or a hardware device such as, for example, an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA). In one embodiment, the determination module 689 is implemented as part of the I/O manager 362 (see FIG. 10) described above.

[0216] In one embodiment, the storage management device 638 further includes at least one data store 643 having an associated current store 644 and time store 646. For example, data associated with one or both of the current store 644 and the time store 646 may be stored in memory of the storage management device 638. Data associated with one or both of the current store 644 and the time store 646 may also be stored in the physical store 636, for which data may be stored directly or virtualized, and so forth. The storage management device 638 keeps track of the data in the current store 644 and the time store 646. For example, the storage management device 638 reads and writes data from the memory and/or physical store 636 and maintains the time store 646 using indices and pointers to the data. Again, the data store 643, its current store 644, and its time store 646 may have the capacity of, and may be implemented as, the data store, current store, and time store described above, respectively, with the additional functionality described herein. In yet another embodiment, as described above, the storage management device 638 includes more than one data store, such as, for example, two, three, or any number of data stores.

[0217] As previously described, when the storage management device 638 receives a write operation directed to the data store 643 from the host 634, the storage management device 638 keeps a record of the write operation. In one embodiment, the storage management device 638 employs a copy-on-write procedure and updates the history index. For example, after receiving a write operation, but before performing the write operation, the storage management device 638 copies any old data from the data store 643 that is to be overwritten by the new data contained in the write operation. The storage management device 638 saves the "old data" to a new destination within the data store 643 and updates the history index. In one embodiment, for example, for each occurrence of a write operation, the storage management device 638 records a timestamp indicating the time that the old data was overwritten, records the address range within the data store 643 that the old data was overwritten, and records the new address range within the data store 643 that the old data is now stored within. Thus, the storage management device 638 maintains an index that can be consulted, as described below, in response to requests for a history of modifications to locations within the data store 643.

[0218] Although described with respect to copy-on-write operations, it should be understood that the principles just described will be applicable to any data storage system in which a log or index of changes is recorded. For example, if actual writes to the data store are recorded, instead of or in conjunction with logging data that was written prior to overwriting, the system may still provide information as described above regarding when the storage location was modified, and this information may be determined from a log or index of changes. Also, it should be understood that in some cases, some but not all changes to the data store may be recorded, and the data store may in such cases provide only the modification information that it has available.

[0219] Referring now to FIG. 14A, in an overview of one embodiment of a method 700 for providing a modification history for a location within a data store, such as using the exemplary storage system 630 of FIG. 13, a storage management device 638 receives a request at step 704 for a modification history for a location within a data store 643. The storage management device 638 then determines at step 708 at least one time at which at least a portion of the data stored at the location specified in the received request was modified. Then, at step 712, the storage management device 638, in response to the received request, transmits at least one of the times determined at step 708. Optionally, the storage management device 638 also identifies, at step 710, for each time determined at step 708, the address range within the data store 643 at which the data was modified. At step 714, the storage management device 638 may optionally also transmit the address range identified at step 710 in response to the received request.

[0220] In more detail, at step 704, the host 634 sends a request for a modification history of a location within the data store 643 via its transmitter 683 and over the first communication link 640. The request may be communicated in any form or manner useful for making a request. In one embodiment, for example, the request is communicated in the form of a data packet. The request is received at the receiver 687 of the storage management device 638. In one embodiment, the location specified in the request is an address range within the data store 643. The address range may be assigned, for example, an LBA and a length. In one embodiment, the LBA specifies the beginning of the address range, and the length specifies the length of the address range. For example, in one embodiment, the memory address space of the data memory 643 is partitioned into blocks (e.g., sectors) where each block is 512 bytes long. In this case, the LBA is used to assign a particular 512 byte block (i.e., a 512 byte block at the beginning of the address range) and the length is used to assign how many 512 byte blocks are contained within the address range. For example, where host 634 requests a modification history for an address range in data store 643 that starts at byte 8192 and is 4096 bytes in length, the request would include an LBA of 16 and a length of 8.

[0221] After the storage management device 638 receives the request for the modification history for the location within the data store 643, the determination module 689 of the storage management device 638 determines one or more times at which at least a portion of the data stored at the location was modified at step 708. In one embodiment, for example, the determination module 689 parses a previously described historical index that lists modifications to (e.g., write operations performed on) the data store 643. The index may be stored, for example, as part of the time store 646 of the data store 643. The determination module 689 then determines which of those listed modifications was made to the data for an address range that at least partially overlaps with the address range for the requested location, and notes the time(s) at which such modification was made. However, there may be situations where the data for the address range of the requested location is not modified. In this case, the storage management device 638 will send a negative response (i.e., a response indicating that the data at the address range of the requested location is not modified) at step 712 (described below).

[0222] Typically, one or more subsets, intersections, supersets, and/or universes of the data stored at the location within the data store 643 may have been modified at one or more times prior to receiving a request for a modification history. For example, the request received by the storage management device 638 may be a request for a modification history for a location having an address range (LBA0, length 64). Prior to receiving the request, data stored at the address range (LBA0, length 8) (i.e., a subset of the location), data stored at the address range (LBA62, length 16) (i.e., an intersection set of the location), data stored at the address range (LBA0, length 128) (i.e., a superset of the location), and/or data stored at the address range (LBA0, length 64) (i.e., a full set of the location) may have been modified at one or more times. In one embodiment, after the times at which these sets (and/or any other sets that at least partially overlap the address ranges of the requested locations) have been determined to be modified at step 708, the determination module 689 of the storage management device 638 also identifies the address ranges of these previously modified sets at step 710.

[0223] At step 712, for example in the embodiment of FIG. 13, the storage management device 638 transmits via its transmitter 691 and over the first communication link 640 one or more determined times at which at least a portion of the data stored at the location was modified. Optionally, at step 714, the storage management device 638 may additionally transmit, via its transmitter 691 and over the first communication link 640, the one or more identified sets of address ranges modified at the one or more determined times. The one or more determined times and/or one or more identified sets of address ranges may be communicated in any form or manner useful for providing such information. For example, the information is communicated in the form of data packets. In one embodiment, host 634 receives these one or more determined times and/or one or more identified sets of address ranges at its receiver 681. And, optionally, the transmitter may transmit the modified data.

[0224] In one embodiment, the storage management device 638 sends the modification information in a single packet. For example, a single transmitted packet identifies each set of address ranges that are modified and lists for each set the time at which it was modified. In another embodiment, the storage management device 638 sends the determined time and the identified set of address ranges separately, e.g., in separate packets, and additionally provides additional information to the host 634 to associate the determined time with the identified set of address ranges. In yet another embodiment, the storage management device 638 also transmits to the host 634 the data that was stored in the identified set of address ranges before the determined time was modified. In doing so, the storage management device may identify which determined set of time and/or address ranges corresponds to a given piece of data that is later modified.

[0225] FIG. 14B depicts one embodiment of a method 700 ', which method 700' is a variation of the method 700 of FIG. 14A for providing a modification history for a location within a data store, and again using the exemplary storage system 630 of FIG. 13. Generally, the steps of method 700' are performed in the same or similar manner as the steps of method 700 described above, except as set forth herein.

[0226] In one embodiment, similar to method 700, the storage management device 638 receives a request for a modification history for a location within the data store 643 at step 704'. In this embodiment, however, the request for the modification history is a request for a list of times from each of which all data stored at a specified location in the request, rather than only some portion of the data, has been modified. Thus, the storage management device 638 determines at least one time from which all of the data stored at the location was modified at step 708 ', and transmits the at least one determined time in response to the received request at step 712'. Optionally, the storage management device 638 also sends an address range within the data store 643 in response to the received request at step 714', at which the entire data is modified from at least a determined time. If sent, the address range will be the same as the location specified in the request for the modification history.

[0227] In the embodiment of method 700' described above, the data stored at the location specified in the request for the modification history may all have been modified, however, at the same time, not all of the data must be modified in order to satisfy the conditions of the request. Stated another way, at least a portion of the data stored at the location specified in the request for the modification history may have been modified at a different time (i.e., at a subsequent time) than at least one time determined by the storage management device 638 at step 708 'of the method 700'. If, for example, all of the data stored at the location specified in the request for modification history was modified at the first time T1, a first portion of the data stored at the location, but not all of the data, was modified at the second time T2, a second portion of the data stored at the location, but not all of the data, was modified at the third time T3 (where the first and second portions of the data totaled to all of the data stored at the location specified in the request for modification history), and all of the data stored at the location was again modified at the fourth time T4 (where T1, T2, T3, and T4 occur in chronological order), the storage management device 638 would determine at step 708' that the times at which all of the data stored at the location had been modified were T1, T2, and T4.

[0228] In the event that the user knows that all of the data stored at that location is corrupted and needs to be replaced (e.g., in the event that the user knows that the entire JPEG file is corrupted), it is particularly useful to be able to request a list of times from each of which all of the data stored at a particular location, but not just some portion of the data, was modified, as just described for method 700'. Knowing the time determined by the storage management device 638 at step 708, the user can then request that the storage management device 638 generate an image of the location at a time just prior to the determined time. The user can thus identify the most recent time at which all of the data was intact (i.e., not corrupted) and can choose to restore the data at that location back to the data that appeared at that location at that most recent time.

[0229] FIG. 15 depicts an illustrative embodiment of a request 800 for a modification history for a location within a data store (e.g., data store 643) that the request 800 may be sent by a host (e.g., host 634) to a storage management device (e.g., storage management device 638) in accordance with the present invention. In one embodiment, request 800 is in the form of data packet 804, as illustrated. The data packet 804 may include at least a portion of an I/O command, which may be in a standard I/O command format, such as a SCSI command format.

[0230] In one embodiment, data packet 804 includes 16 bytes of request data. In byte 0, the opcode identifies the type of request to be executed (e.g., provides a modification history for the location within the data store 643). For example, the opcode may be associated with a request for at least one time at which at least a portion of the data stored at a location within the data store 643 was modified or a request for a list of times from each of which all of the data stored at a location within the data store 643, but not only some portion of the data, was modified. An exemplary opcode is C1h, which is the code assigned to vendor specific requests in the SCSI protocol.

[0231] The three most significant bits of byte 1 (i.e., bits 5-7) are left for future use. Optionally, the remaining 5 least significant bits (i.e., bits 0-4) of byte 1 specify a service action field (e.g., a field containing a code value that identifies the function to be performed under the more general request specified in the opcode of byte 0). Alternatively, in another embodiment, bits 0-4 of byte 1 are also reserved for future use.

[0232] Bytes 2-9 are for a Logical Block Address (LBA), which identifies a first unit of storage (i.e., a first block) for a location for which a modification history is requested. Bytes 10-13 are for a length that indicates the number of memory locations, including the first memory location identified by the LBA, that sequentially constitute a location within the data store 643. In one embodiment, the logical block addresses and lengths comprise an address range.

[0233] Byte 14 is reserved for future use. For example, byte 14 may be used as a Relative Check (Relative Check) field to indicate whether one or more times to be returned by the storage management device 638 are Relative or absolute. If, for example, the relative check field is 0, then the one or more times returned by the storage management device 638 are based on the current time. In other words, a 0 in the relative check field indicates that the one or more times to be returned by the storage management device 638 are elapsed times calculated from the current time. On the other hand, if, for example, the relative check field is non-zero, then one or more times returned by the storage management device 638 will be specified absolutely, i.e., without reference to another time.

[0234] Byte 15 is a control field of data packet 804. For example, in one particular embodiment, where data packet 804 is implemented in a typical SCSI command format, bit 0 of byte 15 may be used (e.g., may be set) to specify a request for task continuation across two or more commands (i.e., concatenating consecutive commands), bit 1 of byte 15 may provide a method of requesting an interrupt between concatenated commands, bit 2 of byte 15 may be used to specify whether auto condition integration (auto condition integration) should be established under certain conditions, bits 3-5 of byte 15 may be reserved, and bits 6-7 may be vendor-specific bits.

[0235] FIG. 16 depicts an illustrative embodiment of a response 900, and in particular, a response 900 to a request 800, the request 800 being a request for a modification history for a location within a data store 643, the response 900 being transmittable by the storage management device 638 to the host 634. In one embodiment, as illustrated, response 900 takes the form of data packet 904. The data packet 904 may include at least a portion of an I/O response, which may be in a standard I/O response format, such as a SCSI response format.

[0236] In one embodiment, data packet 904 includes at least 30 bytes of a response code, as illustrated, and may include additional bytes of the response code, as described below. Fig. 16 identifies each bit that may be included in an exemplary byte of a response code. Bytes 0-1 are reserved for future use.

[0237] Bytes 10-13 are for an LBA that identifies the first unit of storage (i.e., the first block) included in the set of at least a portion of the location specified in request 800. In other words, the LBAs represented in bytes 10-13 identify the first storage location, e.g., a subset of the locations specified in request 800, an intersection set of the locations specified in request 800, a superset of the locations specified in request 800, or the full set of locations specified in request 800. Bytes 14-21 are for a length that indicates the number of storage units, including the first storage unit identified by the LBA in bytes 10-13, which sequentially make up the collection. In one embodiment, the LBAs and lengths comprise the address range of the set. As indicated by this information, the data stored in the identified address ranges in the collection is modified prior to the point in time that the request 800 is received by the storage management device 638. As such, bytes 22-29 are used for a determined change time that indicates a time at which data stored in the address range of the set identified in bytes 10-21 was modified.

[0238] Together, bytes 10-29 (i.e., LBA, length, and determined change time) make up a tuple (tuple). Data packet 904 may include any number of tuples (e.g., one, two, or more tuples). Bytes 30-n of data packet 904 are used for the repetition of the byte groups. In one embodiment, the number of tuples contained within the data packet 904 is, or corresponds to, the number of times at which at least a portion of the data stored at the location specified in the request 800 was modified based on information available to the storage device. Bytes 2-9 are used for an indicator that indicates the number of tuples contained within the data packet 904.

[0239] In one embodiment, the determined change times represented in bytes 22-29 are relative times. Alternatively, in another embodiment, the determined change time is an absolute time. In one embodiment, each tuple may, for example, include an additional byte that serves as a relative check field to indicate whether the determined time of change contained within the tuple is relative or absolute. Alternatively, in another embodiment, all determined change times contained within the n tuples of data packet 904 are either all relative or all absolute, with no difference from one tuple to the next. In such an embodiment, for example, one of reserved bytes 0-1 may be used as a relative check field that indicates whether all determined change times contained within the n-tuples of data packet 904 are relative or absolute. As above, if, for example, the relative check field is 0, then one or more of the determined change times are based on the present time. On the other hand, if, for example, the relative check field is non-zero, then one or more determined change times returned by the storage management device 638 are specified absolutely, i.e., without reference to another time.

[0240] In one embodiment, if the determined change times contained within a tuple are relative, then the actual modification time of the data stored in the address range of the set specified by the tuple is calculated by subtracting the determined change time from the generation time of the response 900. In such an embodiment, the response 900 may be time stamped. On the other hand, if the determined change time contained in a tuple is absolute, then the actual modification time of the data stored in the address range of the set specified by the tuple is only the determined change time.

[0241] 17-20 present an example of how a modification history for a location within the data store 643 may be obtained. FIG. 17 depicts a timeline 1000 of this example. Timeline 1000 illustrates different write operations directed to data store 643 at each of times T1, T2, T3, T4, and T5. Each write operation is denoted as "Wr (LBA, length, data)", where (LBA, length) denotes the address range to which data is written. Thus, at time T1, data is written to the address range (LBA0, length 8); at time T2, data is written to the address range (LBA62, length 16); at time T3, data is written to the address range (LBA100, length 35); at time T4, data is written to the address range (LBA0, length 64); and at time T5, data is written to the address range (LBA0, length 128).

[0242] FIG. 18 depicts an exemplary embodiment of a history index 1100 for use in this example. As described above, after receiving a particular write operation, but before performing the write operation, the storage management device 638 copies the data stored at the address range specified by the write operation and saves it to a new destination. The storage management device 638 then performs the write operation and updates the history index 1100 as described above. For example, after performing a write operation at time T1, the storage management device 638 records, as shown in the second row of the history index 1100, the time T1 at which the write operation was performed, the address range to which the data was written (LBA0, length 8), and the new address range (LBA1000, length 8) at which the data stored at the address range (LBA0, length 8) just prior to time T1 was now stored. As shown in FIG. 18, the history index 1100 is similarly updated after each of the write operations is performed at times T2, T3, T4, and T5.

[0243] According to this example, at some time after time T5, host 634 requests a modification history for a location within data store 643 from storage management device 638. For example, referring now to FIG. 19, a host 634 sends a data packet 1204 to a storage management device 638, the data packet 1204 taking the form of the data packet 804 described with reference to FIG. 15. In this example, the host 634 requests at least one time at which at least a portion of the data stored at the address range (LBA0, length 64) is modified. Thus, the opcode of byte 0 of data packet 1204 is associated with the request, bytes 2-9 of data packet 1204 are set to indicate that the LBA is 0, and bytes 10-13 of data packet 1204 are set to indicate that the length is 64.

[0244] After processing this request for a history of modifications to the address range (LBA0, length 64) within the data store 643 (e.g., after parsing the history index 1100 that lists write operations performed on the data store 643), the storage management device 638 responds to the host 634. For example, referring now to FIG. 20, the storage management device 638 sends a data packet 1304 to the host 634, the data packet 1304 taking the form of the data packet 904 described with reference to FIG. 16. In this example, data packet 1304 includes four tuples, as specified by the indicator in bytes 2-9 of data packet 1304. Referring now to fig. 18 and 20, bytes 10-29 (i.e., the first tuple of data packet 1304) indicate that the address range (LBA0, length 8) (i.e., the subset of the requested address range (LBA0, length 64)) was modified at time T1; bytes 30-49 (i.e., the second tuple of data packet 1304) indicate that the address range (LBA62, length 16) (i.e., the intersection set of the requested address range (LBA0, length 64)) was modified at time T2; bytes 50-69 (i.e., the third tuple of the data packet 1304) indicate that the address range (LBA0, length 64) (i.e., the full set of requested address ranges (LBA0, length 64)) was modified at time T4; and bytes 70-89 (i.e., the fourth tuple of the data packet 1304) indicate that the address range (LBA0, length 128) (i.e., the superset of the requested address range (LBA0, length 64)) was modified at time T5. By receiving the data packet 1304, the host 634, is thereby provided with the times at which at least a portion of the data stored at the address range (LBA0, length 64) within the data store 643 was modified, and the corresponding address ranges that were modified at those times.

[0245] It should also be noted that because the write operation occurring at time T3 of timeline 1000 is directed to an address range (LBA100, length 35) that does not overlap with the requested address range (LBA0, length 64), data packet 1304 does not include any information regarding the write operation.

[0246]Memory buffer selection

[0247] In general, in another aspect, the invention relates to a method and apparatus for optimally selecting one or more storage buffers for data storage. In brief overview, in one embodiment of this aspect of the invention, a first computing device (e.g., a storage management device as described above) receives data that requires temporary or permanent storage. For example, a first computing device receives a write operation from a second computing device (e.g., a host as described above) that includes a data payload that requires temporary or permanent storage. The first computing device initially stores the received data in a first storage buffer and then optimally identifies one or more additional storage buffers within the first computing device that store redundant copies of the received data. The storage buffer may be located, for example, in one of several processor modules in the first computing device.

[0248] In one embodiment of this aspect of the invention, the first computing device evaluates one or more cost equations to optimally identify one or more additional storage buffers for redundantly storing the received copies of data. Further, in one embodiment, the first computing device stores a first copy of the received data in a first optimally identified additional storage buffer, and may also store a second and further copies of the received data in a second and further optimally identified additional storage buffers. Thus, the first computing device may provide redundant storage capacity.

[0249] FIG. 21 illustrates one embodiment of a storage management device 1438 in accordance with this aspect of the invention, the storage management device 1438 optimally identifying one or more storage buffers. In general, the storage management device 1438 may possess the capabilities of, and may be implemented as, the storage management devices described above, with the additional functionality described herein. It should be appreciated that other implementations are possible.

[0250] In one embodiment, the storage management device 1438 includes a plurality of processor modules, such as a first processor module 1478 and at least one second processor module, such as three second processor modules 1478 ', 1478 ", 1478 * (collectively 1478'). However, the first processor module 1478 and the three second processor modules 1478' depicted in the storage management device 1438 of FIG. 21 are merely illustrative. More generally, the storage management device 1438 may include any number of processor modules 1478, 1478'. The number of processor modules 1478, 1478' may be increased or decreased based on considerations such as scalability, performance, and cost. Again, generally speaking, the processor modules 1478, 1478' may possess the capabilities of, and may be implemented as, the processor modules described above (e.g., the processor modules 378 described with respect to FIG. 10), with the additional functionality described herein.

[0251] In one embodiment, the storage management device 1438 is a device for storing data (e.g., for temporarily storing data). Thus, in one such embodiment, the storage management device 1438 includes a plurality of storage buffers, 1463', 1463 ", 1463 * (generally 1463), for storing data. In one embodiment, for example as illustrated in FIG. 21, each processor module 1478, 1478' of the storage management device 1438 includes at least one storage buffer 1463. In another embodiment, some, but not all, of the processor modules 1478, 1478' of the storage management device 1438 include a storage buffer 1463. In yet another embodiment, the storage management device 1438 includes one or more storage buffers 1463, the storage buffers 1463 being independently located on the storage management device 1438 and not part of the processor modules 1478, 1478'. In yet another embodiment, a single processor module 1478, 1478' may include two or more storage buffers 1463. In general, the storage buffer 1463 may possess the capacity of, and may be implemented as, the storage buffer described above (e.g., the storage buffer 363 described with respect to fig. 10) with the additional functionality described herein. For example, the storage buffer 1463 may be contained within the memory 296 (see FIG. 9) of the processor modules 1478, 1478'. In one embodiment, the entire memory 296 forms the storage buffer 1463. In another embodiment, smaller, but contiguous blocks within the memory 296 form the storage buffer 1463. In yet another embodiment, several separate blocks are connected within memory 296, such as by pointers, to form storage buffer 1463. The address space within the memory 296 making up the storage buffer 1463 may be static, or alternatively, it may be dynamically allocated at runtime.

[0252] In one embodiment, at least one processor module (e.g., the first processor module 1478 and/or the at least one second processor module 1478') of the storage management device 1438 includes at least one receiver 1493, one transmitter 1495, one evaluator 1497 and one data operator 1499. The receiver 1493 and the transmitter 1495 may each be implemented in any form, method, or manner that is useful for receiving and transmitting communications, such as, for example, requests, commands, and responses, respectively. In one embodiment, the receiver 1493 and the transmitter 1495 are implemented as software modules with hardware interfaces where the software modules are capable of interpreting communications, or the necessary portions thereof. In another embodiment, the receiver 1493 and the transmitter 1495 are implemented as a single transceiver (not shown). The processor modules 1478, 1478 'use the receiver 1493 and the transmitter 1495 to communicate with one or more of the other processor modules 1478, 1478' and/or with one or more computing devices (not shown) other than the storage management device 1438. The receiver 1493 and the transmitter 1495 may be implemented as multiple devices for different protocols, such as, for example, the target mode driver 382 of fig. 10, a transceiver associated with the internal network 380 of fig. 10, or some combination thereof.

[0253] For their part, the evaluator 1497 and/or the data operator 1499 may be implemented in any form, method, or manner that is capable of performing the functions described below. For example, the evaluator 1497 and/or the data operator 1499 may be implemented as software modules and/or programs running on a microprocessor and/or hardware devices, such as, for example, an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA). In one embodiment, the evaluator 1497 and the data operator 1499 are implemented as part of the host interface 361 described above, such as part of the target mode driver 382 (see fig. 10).

[0254] Referring now to FIG. 22, in an overview of one embodiment of a method 1500 for storing data, such as using the exemplary storage management device 1438 of FIG. 21, data for storage is received at step 1504 from a processor module 1478, 1478 'of the plurality of processor modules 1478, 1478' of the storage management device 1438, say the first processor module 1478. The first processor module 1478 then stores a first instance of the received data (instance) (e.g., the received data itself) in a first storage buffer 1463 on the first processor module 1478 at step 1508, and evaluates the first cost equation to identify a second storage buffer 1463 from the plurality of storage buffers 1463 at step 1512, the second storage buffer 1463 optimally storing a second instance of the received data (e.g., a copy of the received data). Optionally, at step 1516, the first processor module 1478 evaluates the second cost equation to identify a third storage buffer 1463 from among the plurality of storage buffers 1463, wherein a third instance (e.g., another copy) of the received data is optimally stored in the third storage buffer 1463. Again optionally, at step 1520, a second instance of the received data may be stored in the second storage buffer 1463 and a third instance of the received data may be stored in the third storage buffer 1463. Further, it should be appreciated that at steps 1516 and 1520, any number of additional cost equations (e.g., second, third, fourth, and fifth cost equations, etc.) may be evaluated to identify any number of storage buffers 1463 (e.g., third, fourth, fifth, and sixth storage buffers 1463, etc.) at which any number of instances of the received data (e.g., third, fourth, fifth, and sixth instances of the received data, etc.) are optimally stored. Advantageously, by optimally storing the second and further instances of the received data in the second and further storage buffers 1463 from among the plurality of storage buffers 1463, the received data may be quickly and efficiently redundantly stored, thereby improving fault tolerance, and the received data may all be quickly and efficiently accessed without overloading the storage management device 1438.

[0255] In more detail, in one embodiment, the receiver 1493 of the first processor module 1478 receives 1504 a write operation that includes a data payload. The receiver 1493 of the first processor module 1478 can receive write operations, for example, from a computing device (not shown) over a network (not shown) instead of the storage management device 1438. At step 1508, the received write operation is initially stored in the first (in some embodiments only) buffer 1463 of the first processor module 1478. In one embodiment, after the first processor module 1478 receives the write operation, and after it has stored the received write operation in its first buffer 1463, the data operator 1499 of the first processor module 1478 separates the data payload from the remaining write operations so that a first instance of the data payload is created and stored independently in the first buffer 1463 of the first processor module 1478. In one embodiment, the write operation includes at least some control information in addition to the data payload. In such embodiments, the data operator 1499 of the first processor module 1478 operates to separate the data payload from this control information. After separating the data payload from the rest of the write operations, the data operator 1499 of the first processor module 1478 then replicates the first instance of the data payload to create a second instance of the data payload, and optionally another instance of the data payload.

[0256] At step 1512, the evaluator 1497 of the first processor module 1478 evaluates the first cost equation to identify the second storage buffer 1463 from among the plurality of storage buffers 1463, whereas unlike the first storage buffer 1463 in the first processor module 1478, the first instance of the data payload is initially stored in the first storage buffer 1463 and the second instance of the data payload is optimally stored in the second storage buffer 1463. In one embodiment, the evaluator 1497 identifies the second storage buffer 1463 located on the second processor module 1478'. In one such embodiment, because the second processor module 1478' is a different processor module than the first processor module 1478, storing the second instance of the data payload in the second storage buffer 1463 may prevent the data payload from being lost in the event the first processor module 1478 fails.

[0257] In evaluating the first cost equation at step 1512, the evaluator 1497 of the first processor module 1478 may consider various factors. For example, in one embodiment, for each of the plurality of storage buffers 1463 in the storage management device 1438 that is different from the first storage buffer 1463 in the first processor module 1478 in which the first instance of the data payload was originally stored, the evaluator 1497 of the first processor module 1478 assigns a value to the physical distance from the first processor module 1478 to that storage buffer 1463 in the storage management device 1438. In one such embodiment, the storage buffer 1463 that is physically closest to the first processor module 1478 is identified by the evaluator 1497 as the second storage buffer 1463 in which the second instance of the data payload is optimally stored. For example, in another embodiment, for each of the plurality of storage buffers 1463 in the storage management device 1438 that is different from the first storage buffer 1463 in the first processor module 1478 in which the first instance of the data payload was originally stored, the evaluator 1497 of the first processor module 1478 assigns a value to the available capacity of that storage buffer 1463. In one such embodiment, the storage buffer 1463 having the largest available capacity is identified by the evaluator 1497 as the second storage buffer 1463 in which the second instance of the data payload is optimally stored.

[0258] In yet another embodiment, in evaluating the first cost equation at step 1512, and for each of one or more second processor modules 1478 'that include a storage buffer 1463 (which must be different from the first storage buffer 1463 in the first processor module 1478), the evaluator 1497 of the first processor module 1478 assigns a value to the load present in that second processor module 1478'. In one embodiment, the load in question is the input/output load between the second processor module 1478' in question and a device other than the storage management device 1438 (e.g., a host as described above). Alternatively, in another embodiment, the load in question is, for example, the interconnection load of requests, commands, and responses between the second processor module 1478 'in question and at least one other processor module 1478, 1478'. As such, the storage buffer 1463 of the second processor module 1478' having the lowest load value is identified by the evaluator 1497 as the second storage buffer 1463 in which the second instance of the data payload is optimally stored.

[0259] In some cases, the storage management device 1438 is implemented such that one or more of the plurality of storage buffers 1463 are accessible only to some subset of the plurality of processor modules 1478, 1478'. For example, in such a storage management device 1438 (not shown) that includes processor modules A, B, C and D having storage buffers W, X, Y and Z, respectively, it may be the case that only processor modules A, B and C have access to storage buffer W, only processor modules B and C have access to storage buffer X, only processor modules a and C have access to storage buffer Y, and only processor modules a and D have access to storage buffer Z. Thus, in another embodiment, the evaluator 1497 evaluates the first cost equation at step 1512 to identify the second storage buffer 1463 that stores the second instance of the data payload, such that the number of processor modules 1478 that can access the first instance and/or the second instance of the data payload is maximized when the second storage buffer 1463 and the first storage buffer 1463 (in which the first instance of the data payload was initially stored) are associated. Maximizing the number of processor modules 1478 that can access the first and/or second instances of the data payload may maximize processing flexibility and device efficiency when the memory buffer 1463 storing one instance of the data payload, and/or the processor module 1478 on which the memory buffer 1463 is located, fails. In one implementation of this embodiment, for each of a plurality of storage buffers 1463 in the storage management device 1438 that are different from the first storage buffer 1463 in the first processor module 1478 in which the first instance of the data payload is initially stored, the evaluator 1497 of the first processor module 1478 assigns a value to the number of processor modules 1478, 1478 'in the storage management device 1438 that refers to the number of processor modules 1478, 1478' that can access at least one of the first and second instances of the data payload if the second instance of the data payload is stored in that storage buffer 1463. In one such embodiment, the storage buffer 1463 that is capable of maximizing the number of processor modules 1478 that may access the first and/or second instances of the data payload may thus be identified by the evaluator 1497 as the second storage buffer 1463 in which the second instance of the data payload is optimally stored if the second instance of the data payload is stored there.

[0260] In yet another embodiment, to determine the second storage buffer 1463 that optimally stores the second instance of the data payload, the evaluator 1497 of the first processor module 1478 considers all of the factors described above, or some subset thereof, and adds a weight to each factor it considers. In one such embodiment, the second storage buffer 1463 storing the second instance of the data payload is the storage buffer 1463 that exhibits the best weight combination for the factors under consideration. In practice, the weight of each factor may be varied to suit a particular application.

[0261] Additionally, in another embodiment, the weights for one or more factors considered for those storage buffers 1463 may be pre-adjusted for one or more of the plurality of storage buffers 1463, thereby making it less desirable to store a copy of the data payload therein. This may be done, for example, to artificially limit the amount of data stored in those storage buffers 1463, to control/limit requests to those particular storage buffers 1463, and/or to set an upper limit for their performance and therefore for the performance of the storage management device 1438.

[0262] In one embodiment, the storage management device 1438 stores more than one copy of the received data payload. Thus, in such an embodiment, the evaluator 1497 of the first processor module 1438 evaluates the second cost equation, and optionally, the third, fourth, and fifth cost equations, etc., at step 1516. Evaluation of the second cost equation identifies a third storage buffer 1463 from among the plurality of storage buffers 1463, the third storage buffer 1463 being different from the first and second storage buffers 1463 (e.g., the first, second, and third storage buffers may each be located on a different processor module 1478, 1478'), wherein a third instance of the data payload is optimally stored in the third storage buffer 1463. In one embodiment, the second cost equation evaluated by the evaluator 1497 of the first processor module 1478 is the same as the first cost equation described above, except that both the first and second storage buffers 1463 (the second storage buffer 1463 having been identified by evaluation of the first cost equation) are not considered by the evaluator 1497. Alternatively, in another embodiment, the second cost equation is different from the first cost equation. For example, the factors considered in each of the first and second cost equations are the same, but the weight assigned to each factor considered is different. Alternatively, as another example, the factors considered in one cost equation may be some subset of the factors considered in another cost equation.

[0263] In yet another embodiment, only the first cost equation is evaluated and the third instance of the data payload is stored, except for the first storage buffer 1463 in the first processor module 1478 in which the first instance of the data payload was initially stored, and any storage buffers 1463 other than the second storage buffer 1463 identified in evaluating the first cost equation.

[0264] In one embodiment, at step 1520, the second, third, and/or further instances of the data payload are stored in the second, third, and/or further storage buffers 1463 identified at steps 1512 and/or 1516, respectively. To enable this to occur, the transmitter 1495 of the first processor module 1478 transmits the second, third, and/or additional instances of the data payload to the second, third, and/or additional storage buffers 1463, respectively. Thus, the data payload of the received write operation is redundantly stored in one or more storage buffers 1463 of the storage management device 1438.

[0265]Clock synchronization

[0266] In general, in another aspect, the invention relates to a method and apparatus for synchronizing internal clocks of multiple processor modules. In brief overview, in one embodiment of this aspect of the invention, a multiprocessor system (e.g., a storage management device as described above) includes a plurality of processor modules, wherein each of the plurality of processor modules includes its own internal clock. Synchronization between the internal clocks of the multiple processor modules may be achieved by assigning one of the processor modules as a master processor module having a master internal clock for the multiprocessor system, and by having each of the other processor modules (assigned as slave processor modules) in the multiprocessor system periodically compare its internal clock to the master internal clock and correct its internal clock if necessary. In one embodiment, the slave processor modules correct their internal clocks without ever causing them to go backwards in time.

[0267] FIG. 23 illustrates one embodiment of a multiprocessor system 1638 that maintains a substantially consistent running clock in accordance with this aspect of the invention (e.g., a memory management device that, in general, possesses the capabilities of, and may be implemented as, the memory management device described above with the additional functionality described herein). The multiprocessor system 1638 includes multiple processor modules 1678, 1678 ', 1678 ", 1678 *, wherein each of the processor modules includes its respective internal clock 1675, 1675', 1675", 1675 *. Again, the four processor modules 1678, 1678', 1678 ", 1678 * depicted in the multiprocessor system 1638 of fig. 23 are merely illustrative, and more generally, the multiprocessor system 1638 may include any number or type of processor modules.

[0268] The internal clock(s) of one or more of the processor modules of the multiprocessor system 1638 may "drift" from the internal clocks of other processor modules, such as temperature differences between processor modules due to one processor module warming relative to other processor modules. For example, it may be the case that the internal clock 1675 "of the processor module 1678" starts running faster than and drifts away from the other internal clocks 1675, 1675 ', 1675 "of the multiprocessor system 1638 than the other internal clocks 1675, 1675', 1675" of the multiprocessor system 1638. Thus, in order to synchronize the internal clocks 1675, 1675', 1675 ", 1675 * of the multiprocessor system 1638, and thereby maintain a reliable running clock for the multiprocessor system 1638, the internal clock 1675" is corrected, e.g., as described herein in accordance with this aspect of the invention.

[0269] In one embodiment of this aspect of the invention, a first processor module, e.g., as illustrated, the processor module 1678 is assigned as the main processor module of the multiprocessor system 1638. The host processor module 1678 includes a master internal clock for the multiprocessor system 1638. In one such embodiment, each of the other processor modules (i.e., at least one other processor module) 1678', 1678 ", 1678 * is assigned as a slave processor module to the multiprocessor system 1638. Each slave processor module 1678 ', 1678 ", 1678 * (generally 1678') includes its respective slave processor module internal clock 1675 ', 1675", 1675 * (generally 1675'). In one embodiment, the slave processor modules 1678' periodically compare their internal clocks to the master internal clock 1675 and correct their internal clocks if necessary in accordance with a method that will be described below.

[0270] Referring now to FIG. 24, in one embodiment of a method 1700 for maintaining a substantially consistent running clock for a multiprocessor system 1638, a slave processor module 1678 'synchronizes the slave processor module internal clock 1675' with a master internal clock 1675 by iteratively performing steps 1704, 1708, 1712, 1716, and, if necessary, 1720 of the method 1700. Optionally, step 1710 may also be performed after steps 1704 and 1708, but before steps 1712, 1716, and 1720. In one embodiment, the iterations of the method 1700 via steps 1704, 1708, 1710 (optionally), 1712, 1716, and, if necessary, 1720, are performed periodically by the slave processor module 1675', such as each portion of a second (e.g., half a second) or other amount of time. Further, in some embodiments, the slave processor module 1678 'initializes the slave processor module internal clock 1675' at step 1702 before iteratively performing steps 1704, 1708, 1710 (optionally), 1712, 1716, and, if necessary, 1720.

[0271] In one embodiment, to initialize the slave processor module internal clock 1675 'at step 1702, the slave processor module 1678' requests the current time of the master internal clock 1675 and, after a certain period of time, receives the current time of the master internal clock 1675. In one embodiment, if the period between the slave processor module's request for and receipt of the current time of the master internal clock 1675 is less than a first predetermined amount of time, the slave processor module 1678' initializes the slave processor module internal clock 1675 'to the sum of the received current time of the master internal clock 1675 and half of the period between the slave processor module's request for and receipt of the current time of the master internal clock 1675. Otherwise, if the period between the slave processor module's request and receipt of the current time of the master internal clock 1675 is greater than a first predetermined amount of time, the slave processor module 1678' discards the received current time of the master internal clock 1675 and requests a new current time at the master internal clock 1675. In some embodiments, the slave processor module 1678' continues to discard the received current time of the master internal clock 1675 and request a new current time of the master internal clock 1675 until it receives a current time of the master internal clock 1675 within a first predetermined amount of time. The slave processor module 1678 'then initializes the slave processor internal clock 1675' as described above.

[0272] In one embodiment, the first predetermined amount of time is pre-stored in the memory 296 (see FIG. 9) of the slave processor module 1675'. Further, the first predetermined amount of time is configurable based on the hardware layout of the multiprocessor system 1638. In one embodiment, the first predetermined amount of time is set to a specific time that falls between about 26 microseconds and about 35 microseconds.

[0273] In an alternative embodiment, rather than initializing the slave processor module internal clock 1675 as described above, step 1702 is not performed and the slave processor module 1678 'computes instead, as described below, the offset between the slave processor module internal clock 1675' and the master internal clock 1675.

[0274] In summary, to synchronize the slave processor module internal clock 1675 ' with the master internal clock 1675, the slave processor module 1678 ' first requests the current time according to the master internal clock 1675 at step 1704 and at the first time according to the slave processor module internal clock 1675 '. The request may be communicated in any form or manner useful for making a request. In one embodiment, for example, the request is communicated in the form of a data packet. The slave processor module 1678 'also records a first time at which the request was made according to the slave processor module internal clock 1675'. At some later time, at step 1708, the slave processor module 1678 ', at a second time according to the slave processor module internal clock 1675', receives the current time according to the master internal clock 1675. The current time according to the master internal clock 1675 may be transmitted to the slave processor module 1678 'or received by the slave processor module 1678' in any form or manner useful for communicating such information. For example, the current time according to the master internal clock 1675 may be transmitted to the slave processor module 1678 'in a data packet and received by the slave processor module 1678'. Again, in a manner similar to step 1704, the slave processor module 1678 'records a second time according to the slave processor module internal clock 1675' at which the current time according to the master internal clock 1675 was received.

[0275] Optionally, after completing steps 1704 and 1708, but before performing steps 1712, 1716, and, if necessary, 1720, the slave processor module 1678 'determines, at step 1710, whether a difference between a first time according to the slave processor module internal clock 1675' (recorded by the slave processor module 1678 'at step 1704) and a second time according to the slave processor module internal clock (recorded by the slave processor module 1678' at step 1708) is less than a second predetermined amount of time. In one such embodiment, as illustrated in FIG. 24, the steps 1712, 1716, and, if necessary, 1720, are performed only if the slave processor module 1678 ' determines that the difference between the first time according to the slave processor module internal clock 1675 ' and the second time according to the slave processor module internal clock 1675 ' is less than a second predetermined amount of time. Otherwise, the slave processor module 1678' returns to step 1704. By doing so, the slave processor module 1678' eliminates consideration of all the current time of the master internal clock 1675 received after the chaotic delay and thereby prevents false clock synchronization.

[0276] In a similar fashion to the first predetermined amount of time described above with respect to step 1702, the second predetermined amount of time may be pre-stored in the memory 296 (see FIG. 9) of the slave processor module 1675' and configurable based on the hardware layout of the multiprocessor system 1638. In one embodiment, the second predetermined amount of time is set to a specific time that falls between about 26 microseconds and about 35 microseconds, similar to the first predetermined amount of time.

[0277] Following the completion of steps 1704, 1708 and, optionally, 1710, the slave processor module 1678 ' calculates an expected time at step 1712 by using at least a first time according to the slave processor module internal clock 1675 ' (recorded by the slave processor module 1678 ' at step 1704) and a second time according to the slave processor module internal clock 1675 ' (recorded by the slave processor module 1678 ' at step 1708). Optionally, in some embodiments, the slave processor module 1678' also uses an offset in calculating the expected time, for example as described below. In one embodiment, the calculated expected time indicates what the slave processor module 1678 'expects to receive from the master processor module 1678 in response to the slave processor module's request for the current time according to the master internal clock 1675. In other words, in one embodiment, the slave processor module 1678 'assumes that the master internal clock 1675 and the slave processor module internal clock 1675' run at the same speed. In this manner, the slave processor module 1678 ' expects to be able to calculate the current time according to the master internal clock 1675 ' based on the request time (recorded by the slave processor module 1678 ' at step 1704), the response time (recorded by the slave processor module 1678 ' at step 1708), and optionally, any previously determined offset between the slave processor module internal clock 1675 ' and the master internal clock 1675 (as described below).

[0278] At step 1716, the slave processor module 1678' determines if the predicted time is different from the received current time according to the master internal clock 1675. If so, the slave processor module internal clock 1675 ' and the master internal clock 1675 run at a different speed (i.e., the slave processor module internal clock 1675 ' drifts away from the master internal clock 1675) than the slave processor module's assumption at step 1712. Optionally, in one embodiment, the slave processor module 1678', in performing step 1716, determines whether the expected time differs from the received current time according to the master internal clock 1675 by more than a third predetermined amount of time. In such an embodiment, the slave processor module 1678 'only performs step 1720 when the slave processor module 1678' determines that the expected time differs from the received current time according to the master internal clock 1675 by more than a third predetermined amount of time. Otherwise, as illustrated in FIG. 24, the slave processor module 1678' returns to step 1704. By doing so, the slave processor module 1678 'does not correct for minor, often negligible, deviations between the slave processor module internal clock 1675' and the master internal clock 1675.

[0279] Again, a third predetermined amount of time may be pre-stored in the memory 296 (see FIG. 9) of the slave processor module 1675' and may be configurable. The lower third predetermined amount of time causes a tighter synchronization between the slave processor module internal clock 1675' and the master internal clock 1675. In one embodiment, the third predetermined amount of time is set to about 5 microseconds.

[0280] At step 1716, upon determining that the projected time is different from the received current time according to the master internal clock 1675, or alternatively, determining that the projected time differs from the received current time according to the master internal clock 1675 by more than a third predetermined amount of time, the slave processor module 1678 'corrects the slave processor module internal clock 1675' at step 1720. In one embodiment, correction is achieved by effectively "slowing down" or "speeding up" the slave processor module internal clock 1675', as described further below, although other correction techniques may also be used. Having completed step 1720, the slave processor module 1678' then returns to performing step 1704 in the next iteration through the steps of the method 1700. If, on the other hand, the expected time is not different from the received current time according to the master internal clock 1675, or alternatively, the expected time differs from the received current time according to the master internal clock 1675 by no more than a third predetermined amount of time, the slave processor module 1678' does not perform step 1720, but instead returns from step 1716 to step 1704 to begin the next iteration through the steps of the method 1700.

[0281] In general, in a multiprocessor system, such as the multiprocessor system 1638 depicted in FIG. 23, the internal clocks of any two processor modules, say the master processor module 1678 and the slave processor module 1678', even though they may not drift apart from each other, are not completely synchronized in time, but rather differ from each other by some amount at a given point in time. In one embodiment, rather than initializing the slave processor module internal clock 1675 ' in step 1702 as described above, the slave processor module 1678 ' instead calculates the difference or offset between the slave processor module internal clock 1675 ' and the master internal clock 1675. This offset is calculated at a point in time during the first iteration through the steps of the method 1700, and is thereafter used by the slave processor module 1678 'to correct the slave processor module internal clock 1675'.

[0282] Thus, in one such embodiment, in the first iteration through the steps of the method 1700, the slave processor module 1678' calculates the offset after steps 1704, 1708, and optionally 1710, have been completed, but before steps 1712, 1716, and optionally 1720, have been completed. For example, in one embodiment, the slave processor module 1678 ' calculates the offset by subtracting the received current time according to the master internal clock 1675 (received by the slave processor module 1678 ' at step 1708) from half of the sum of the first time according to the slave processor module internal clock 1675 ' (recorded by the slave processor module 1678 ' at step 1704) and the second time according to the slave processor module internal clock 1675 ' (recorded by the slave processor module 1678 ' at step 1708 '). In fact, in such an embodiment, the slave processor module 1678 'assumes that the time it takes to send a request to the master processor module 1678 for the current time according to the master internal clock 1675 is equal to the time it takes to send a response by the master processor module 1678 back to the slave processor module 1678'. Thus, if, in such an embodiment, the time according to the internal clock 1675 'of the slave processor module 1678' is exactly equal to the time according to the master internal clock 1675 of the master processor module 1678, then half of the sum of the first time according to the master processor module internal clock 1675 '(recorded by the slave processor module 1678' at step 1704) and the second time according to the slave processor module internal clock 1675 '(recorded by the slave processor module 1678' at step 1708 ') should be equal to the received current time according to the master internal clock 1675 (received by the slave processor module 1678' at step 1708). If this is not the case in practice, the internal clock 1675 'of the slave processor module 1678' is offset from the master internal clock 1675.

[0283] Further, in another such embodiment, after the calculation of the offset is complete, the slave processor module 1678' then uses the offset to calculate the expected time at step 1712 in the first iteration through the steps of the method 1700 and at step 1712 in subsequent iterations through the steps of the method 1700. In one embodiment, as the slave processor module 1678' iterates through the steps of the method 1700, it does not calculate the offset again after the first iteration through the steps of the method 1700.

[0284] In another embodiment of the method 1700 (where the slave processor module calculates the offset), the slave processor module 1678 'does not adjust the slave processor module internal clock 1678', so its time is exactly equal to the time according to the master internal clock 1675, however, instead, the slave processor module 1678 'corrects the slave processor module internal clock 1675' at step 1720, as explained below, so that the offset does not drift. In other words, the slave processor module 1678 'attempts to keep the slave processor module internal clock 1675' offset from the master internal clock 1675 by an amount. In one such embodiment, the target mode driver 382 (see FIG. 10) of each slave processor module 1678 'timestamps the control information for the received I/O request using the time at which the I/O request was received according to the slave processor module internal clock 1678' plus or minus the offset calculated for that slave processor module internal clock. Thus, in such an embodiment, each slave processor module 1678' in the multiprocessor system 1638 timestamps the received I/O requests with a time substantially equal to the time at which the I/O requests were received according to the master internal clock 1675. Note, however, that due to the clock drift phenomenon described herein, the time at which a received I/O request is time stamped may not exactly equal the time at which the I/O request was received according to the master internal clock 1675. However, the latter problem is handled by the multiprocessor system 1638 as described below, and it does not affect the normal operation of the multiprocessor system 1638.

[0285] In more detailed details of the method 1700, in one embodiment, for each iteration through the steps of the method 1700, the slave processor module 1678 ', in calculating the projected time at step 1712, first calculates the round trip time for the iteration by subtracting the first time according to the slave processor module internal clock 1675 ' (recorded by the slave processor module 1678 ' at step 1708) from the second time according to the slave processor module internal clock 1675 ' (recorded by the slave processor module 1678 ' at step 1704). The slave processor module 1678' may additionally store the calculated round trip time for each iteration through the steps of the method 1700, for example, in its memory 296 (see FIG. 9). Thus, in any current iteration through the steps of the method 1700 after the first iteration through the steps of the method 1700, the slave processor module 1678' may calculate an average round trip time by using the calculated round trip time through the current iteration of the steps of the method 1700, and by using the round trip time through one or more previous iterations of the steps of the method 1700.

[0286] In one embodiment, the average round trip time calculated by the slave processor module 1678' is simply the average of the round trip time of the current iteration at the time through the steps of the method 1700 and the round trip time of all previous iterations through the steps of the method 1700. In another embodiment, the average round trip time calculated by the slave processor module 1678' is a moving average of the round trip time of the current iteration of the then-current iteration through the steps of the method 1700 and the round trip time of one or more of the most recent previous iterations through the steps of the method 1700. In yet another embodiment, the average round trip time calculated by the slave processor module 1678' is a weighted moving average round trip time.

[0287] In one embodiment, on the first and each subsequent iteration through the steps of the method 1700, the slave processor module 1678 ' calculates the predicted time at step 1712 by calculating the sum of the first time according to the slave processor module internal clock 1675 ' (recorded by the slave processor module 1678 ' at step 1704 of the current iteration) and half of the round trip time for that iteration through the steps of the method 1700, and optionally subtracting an offset therefrom. In another embodiment, on an iteration through the steps of the method 1700 subsequent to the first iteration through the steps of the method 1700, the slave processor module 1678 ' calculates the predicted time at step 1712 by calculating the sum of the first time according to the slave processor module internal clock 1675 ' (recorded by the slave processor module 1678 ' at step 1704 of the iteration) and half of the calculated average round trip time (e.g., as described above), and optionally subtracting an offset therefrom.

[0288] Once the slave processor module 1678' has calculated the projected time, it then determines at step 1716 whether the projected time is different from the current time according to the master internal clock 1675, or alternatively whether the difference between the projected time and the current time according to the master internal clock 1675 exceeds a third predetermined amount of time. In one embodiment, to make this determination, the slave processor module 1678 ' first calculates a drift value for each of the steps throughout the method 1700 by subtracting the expected time (calculated by the slave processor module 1678 ' at step 1712 of the iteration) from the then-current time (received by the slave processor module 1678 ' at step 1708 of the iteration) according to the master internal clock 1675. In addition, the slave processor module 1678' may store the calculated drift value throughout each iteration of the method 1700 steps in, for example, its memory 296 (see FIG. 9). Thus, as previously described, in any current iteration through the steps of the method 1700 subsequent to the first iteration through the steps of the method 1700, the slave processor module 1678' may calculate an average drift value by using the calculated drift value for the current iteration through the steps of the method 1700, and by using the drift values for one or more previous iterations through the steps of the method 1700.

[0289] In one embodiment, the average drift value calculated by the slave processor module 1678' is simply the average of the drift value of the current iteration at the time through the steps of the method 1700 and the drift values of all previous iterations through the steps of the method 1700. In another embodiment, the average drift value calculated by the slave processor module 1678' is a moving average of the drift value of the then current iteration through the steps of the method 1700 and the drift value of one or more most recent previous iterations through the steps of the method 1700. In yet another embodiment, the average drift value calculated by the slave processor module 1678' is a weighted moving average drift value.

[0290] In one embodiment, on the first and each subsequent iteration through the steps of the method 1700, when the drift value for that iteration is non-zero, the slave processor module 1678 'determines at step 1716 that the projected time is different from the received current time according to the master internal clock 1675 (received by the slave processor module 1678' at step 1708 of the current iteration). In another embodiment, on an iteration through the steps of the method 1700 subsequent to the first iteration through the steps of the method 1700, when the calculated average drift value, for example, is non-zero as described above, the slave processor module 1678 'determines that the projected time is different from the received current time according to the master internal clock 1675 (received by the slave processor module 1678' at step 1708 of the iteration).

[0291] Upon determining that the projected time is different from the received current time according to the master internal clock 1675, or alternatively, upon determining that the projected time differs from the received current time according to the master internal clock 1675 by more than a third predetermined amount of time, the slave processor module 1678 'corrects the slave processor module internal clock 1675' at step 1720. In one embodiment, in the event that the expected time exceeds the received current time according to the master internal clock 1675 (or, alternatively, exceeds the received current time according to the master internal clock 1675 by more than a third predetermined amount of time), which means that the slave processor module internal clock 1675 'has run faster than the master internal clock 1675, the slave processor module 1678' corrects the slave processor module internal clock 1675 'by slowing the slave processor module internal clock 1675'. In another embodiment, in the event that the received current time according to the master internal clock 1675 exceeds the expected time (or alternatively exceeds the expected time by more than a third predetermined amount of time), which means that the slave processor module internal clock 1675 'has run slower than the master internal clock 1675, the slave processor module 1678' corrects the slave processor module internal clock 1675 'by speeding up the slave processor module internal clock 1675'.

[0292] In one embodiment, the multiprocessor system 1638 includes a free-running counter that may be incremented with each execution of a single CPU instruction, and the slave processor module 1678 'is configured to implement the slave processor module internal clock 1675' by calibrating the count of the free-running counter to microseconds. The slave processor module 1678 'may, for example, initially be configured to assume a count of 1 microsecond equal to 2800 free-running counters (e.g., the slave processor module 1678' may, for example, initially be configured to assume 1 microsecond equal to the time required to execute 2800 CPU instructions, such as might be the case with a 2.8GHz (gigahertz) CPU clock, and a CPU executing one instruction per clock cycle). Thus, in one embodiment, to slow down the slave processor module internal clock 1675 ', the slave processor module 1678' increments the number of counts of the free-running counter it assumes in a given time interval without affecting the free-running counter. Likewise, to speed up the slave processor module internal clock 1675 ', the slave processor module 1678' may reduce the number of counts of the free-running counter it assumes in a given time interval without affecting the free-running counter. Importantly, in some such embodiments, the slave processor module 1678 'corrects the slave processor module internal clock 1675' in such a way that it never goes backwards in time. More specifically, the slave processor module internal clock 1675' continually advances in time, either by slowing down or speeding up for correction as described above.

[0293] FIG. 25 depicts an exemplary graph 1800 of time as a function of the slave processor module internal clock 1675' versus time as a function of the master internal clock 1675. In this exemplary graph, for simplicity of explanation, it may be assumed that the offset is zero if calculated as described above, even though it would not be necessarily zero if it were actually calculated as described above. Thus, ideally, as represented by line 1804, the time according to the slave processor module internal clock 1675' is always equal to the time according to the master internal clock 1675. In practice, however, the slave processor module 1678 'may drift relative to the master processor module 1678 (e.g., due to temperature changes), such that the slave processor module internal clock 1675' runs faster than the master internal clock 1675 (as represented by segments 1808 and 1812). Alternatively, the master processor module 1678 may drift relative to the slave processor module 1678 '(e.g., due to temperature changes) such that the master internal clock 1675 runs faster than the slave processor module internal clock 1675' (as represented by line segment 1816). In this manner, the slave processor module 1678 'corrects the slave processor module internal clock 1675' in accordance with the method 1700 described above to "slow down" the slave processor module internal clock 1675 'relative to the master internal clock 1675 (as represented by the exemplary line segment 1816), or alternatively, to "speed up" the slave processor module internal clock 1675' relative to the master internal clock 1675 (as represented by the exemplary line segment 1812). As described, the slave processor module 1678 'corrects the slave processor module internal clock 1675' in such a way that it never goes backwards in time.

[0294] In another embodiment, the multiprocessor system 1638 of FIG. 23 is a server in a network (not shown). Thus, a processor module, say slave processor module 1678', may receive one or more write operations from another computing device (e.g., a host) in the network. In such an embodiment, the slave processor module 1678 'may determine, at step 1716 throughout an iteration of the method 1700 steps, that the projected time differs from the current time received from the master internal clock 1675 (received by the slave processor module 1678' at step 1708 throughout that iteration of the method 1700 steps) by less than a specified amount of time 1820, which is shown on the curve 1800 of FIG. 25 and is greater than the previously described third predetermined amount of time. In this case, the slave processor module 1678' acknowledges the received write operation before the write operation is actually completed. Alternatively, the slave processor module 1678' may determine that the expected time differs from the received current time according to the master internal clock 1675 by more than a specified amount of time 1820. In this case, the slave processor module 1678 'refrains from acknowledging the received write operation until the projected time is again determined to differ from the received current time according to the master internal clock 1675 by less than the specified amount of time 1820 through correction of the slave processor module internal clock 1675' as described above with reference to the method 1700. Likewise, in the latter case, all other processor modules in the multiprocessor system 1638 may simultaneously refrain from acknowledging a received write operation until the expected time is again determined to differ from the received current time according to the master internal clock 1675 by less than the specified amount of time 1820, as calculated by the slave processor module 1678'. In these embodiments, the most extreme case, i.e., the multiprocessor system 1638 will continue to acknowledge received write operations, occurs when the internal clock of the first slave processor module runs faster than the master internal clock 1675 and drifts in a positive direction to a specified amount of time 1820, while the internal clock of the second slave processor module runs slower than the master internal clock 1675 and drifts in a negative direction to a specified amount of time 1820.

[0295] In one embodiment, the specified amount of time 1820 is one-half of the minimum amount of time that a host in the network may request the multiprocessor system 1638 to process a first write operation, thereafter receive an acknowledgement of the request from the multiprocessor system 1638, and thereafter request the multiprocessor system 1638 to process a second write operation. In such an embodiment, assuming the extreme case described above, the host may send a first write operation to a first slave processor module whose internal clock has drifted in the positive direction to a specified amount of time 1820, thereafter receive an acknowledgement of the first write operation from the multiprocessor system 1638, and thereafter immediately send a second write operation to a second slave processor module whose internal clock has drifted in the negative direction to the specified amount of time 1820, and still ensure that the target mode driver 382 (see FIG. 10) of the second slave processor module will timestamp the received second write operation with a time that is later than the time it took for the target mode driver 382 (see FIG. 10) of the first slave processor module to timestamp the received first write operation. Alternatively, in still other embodiments, the specified amount of time may additionally be set to any amount of time that ensures the correct order of the received write operations is processed in the multiprocessor system 1638.

[0296] In yet another embodiment, where the multiprocessor system 1638 includes a free-running counter and the master processor module 1678 is configured to implement the master internal clock 1675 by calibrating the count of the free-running counter to microseconds, as described above for the slave processor module internal clock 1675', the master processor module 1678 maintains a calibration table that relates the master internal clock 1675 to the real-world clock 1675. In one embodiment, 2800 counts on the free running counter equals 1 microsecond on the real world clock, as described above. In one such embodiment, when the multiprocessor system 1638 presents a time to a user at a host in the network, the calibration table at the main processor module 1678 is first consulted to convert the runtime maintained by the multiprocessor system 1638 to real world time.

[0297]Map generation and use

[0298] In general, additional aspects of the invention relate to systems, methods, and articles of manufacture that generate an image of a data store at a specified past time by using a map (e.g., a time map) of locations of data stored in the data store at the past time. The mapping allows the data storage system to quickly and efficiently determine the location of data stored in the data store at a past time without searching through an entire index of records regarding past data locations.

[0299] In brief overview, in one embodiment of the present invention, a data storage system includes a storage management appliance including a receiver for receiving a past time convention, and an I/O processor for processing I/O requests directed to one or more target storage units in a data store. As previously described, in one embodiment, the storage unit is a single byte or multi-byte group of a data memory block. The storage management device also includes an indexing module to record write requests processed by the I/O processor. The indexing module includes a memory that stores a record for each write request, which may include: 1) an identification of a target storage unit; 2) the location of data previously stored in the target storage unit; and 3) a write time, which represents the time at which the write request was received. In addition, the storage management device includes a mapping module that uses one or more records to generate a map of locations of data stored at the target storage unit at the specified past time. An image generation module included in the storage management device presents an image of the data store at a past time based at least in part on the mapping generated by the mapping module.

[0300] FIG. 26 illustrates a storage management device 1938 according to an embodiment of this aspect of the invention. The storage management device 1938 may be integrated in the data storage systems described herein, e.g., the data storage systems described with reference to fig. 1, 4, 5, and 13. As one example, the storage management device 1938 may communicate with a host and physical memory, providing the host with access to data stored in the physical memory. In addition, various methods may be used to organize and present the data stored in the physical memory to the host. For example, the storage management device 1938 may present one or more volumes, including logical volumes, to the host. Also, as previously discussed, the storage management device 1938 may provide the host with access to one or more current stores and one or more time stores associated with the plurality of data stores. In addition, as previously described, the image presented to the host may be a fixed or dynamic image. The storage management device 1938 may also implement additional functionality of the storage management device attributed to previously described aspects and embodiments.

[0301] In one embodiment, the storage management device 1938 includes a receiver 1961, a processor 1962, an indexing module 1995, a mapping module 1997, and an image generation module 1999 in communication with each other. Each of these units may be implemented in software, hardware, or some combination of both software and hardware. Receiver 1961, for example, can be implemented as part of one or more host interfaces 361 of fig. 10. The receiver 1961, in one embodiment, is implemented in the target mode driver 382 of fig. 10. The receiver 1961 communicates with the host and receives a specification of elapsed time. The past time is part of a host's request to the storage management device to present a data memory image at the past time. The request may also include an identification of the particular data store, and sometimes a logical block address and length.

[0302] In one embodiment, the request for an image of the data store at a past time is received in-band by the receiver 1961, that is, from the host via the same communication protocol as the communication protocol used by the host to communicate the data command (e.g., read requests and write requests). In another embodiment, the receiver 1961 receives the request out-of-band. For example, the receiver 1961 receives the request via a different communication protocol than the communication protocol used by the host to communicate the data command, via a different channel (e.g., via a user interface, a physical interface, or a command line console different from the host, such as a system administrator interface), or via some combination thereof.

[0303] Processor 1962 processes I/O requests directed to one or more target storage units. The processor 1962 may be implemented in one of the elements previously described herein. For example, the processor 1962 may be implemented in one or more of the elements shown in the processor module 378 of FIG. 10. In one embodiment, the processor 1962 is implemented in the I/O manager 362 shown in FIG. 10. Processor 1962 processes I/O requests directed to units of storage (e.g., logical blocks) in data storage. The memory cell targeted by the read or write request is also referred to as the target memory cell.

[0304] As previously described, write requests are often directed to multiple storage units. In one embodiment, the storage management device 1938 performs a copy-on-write operation on the target storage unit before rewriting data stored in the target storage unit before performing the write request. The copied data (i.e., past data) is then moved to another location by the storage management device 1938. As described, the actual copy of the data may not be performed upon the occurrence of a write operation in the special case because, for example, the data to be overwritten is already stored elsewhere, or because the data is temporarily stored in memory before being written, or because the data is not moved, but rather a pointer to the data is modified. For example, in one embodiment, each write request directed to a target unit of storage may cause data to be written to both the current store and the time store. Thus, for an immediately subsequent write operation directed to the same target storage unit, it is not necessary to perform the actual copy-on-write because the past data has been stored in the time store. Thus, copy-on-write operations herein may mean actual copying, and may also include optimizations that take into account the effects of copy-on-write. As previously mentioned, the examples described below generally present the operation of the storage management device 1938 as if copy-on-write were always performed, and it is understood that optimizations can be used in practice.

[0305] The storage management device 1938 also includes an indexing module 1995 that stores a record of the location of past data in the storage management system to facilitate subsequent retrieval of past data (among other uses) for an image of the data store that represents past time. The indexing module 1995 can also be implemented in software, hardware, or some combination thereof, and for example in one of the elements previously described herein. For example, in one embodiment, the indexing module 1995 is implemented in one or more of the I/O managers 362 of FIG. 10. The indexing module 1995 includes a memory 1996 for storing location records. In one version of this embodiment, memory 1996 is part of indexing module 1995. In another version, the memory is not an integral part of the indexing module 1995, but is elsewhere within the storage management device 1938, e.g., elsewhere in the processor module 378 of fig. 10. Functionally, the indexing module 1995 records write requests processed by the I/O processor 1962 and stores a record in the memory 1996 for each write request processed. The record includes an identification of the target storage unit, a location of data previously stored in the target storage unit, and a write time indicating a time at which the corresponding write command was received. Each write request may be directed to a single unit of storage, e.g., a block, or multiple units of storage. However, the records stored by the indexing module provide a mechanism by which the data stored in each storage unit at a specified past time can be located. In one embodiment, the time is the time that the storage management device 1938 receives the write command.

[0306] The storage management device 1938 also includes a mapping module 1997 that uses the records stored by the indexing module 1995 to map the current location of past data for a storage unit in the data store at a specified past time. The mapping functionality allows for the rapid generation of past images of the data store. The mapping module 1997 can be implemented in one or more of the elements shown in the processor module 378 of FIG. 10. For example, in one embodiment, the mapping module 1997 is implemented in one or more of the I/O managers 362 shown in FIG. 10. Functionally, the mapping module 1997 creates a list of pointers to locations in the storage management system, e.g., to locations where past data was in physical memory at a specified past time. Once the mapping is created, it may be stored by the storage management device 1938 in a location where it may be quickly accessed in the future to again present an image of the data store at a past time. In one embodiment, for example, one or more of the I/O managers 362 of FIGS. 10 and 11 manage the mapping.

[0307] The mapping may be dynamic, e.g., it may be updated as additional write requests are processed by the processor 1962. In general, such updates are necessary to ensure that the mapping remains accurate when a copy-on-write operation is performed after the time the mapping was generated. The dynamic nature of the mapping will be further explained with reference to fig. 27 and 28.

[0308] The storage management device 1938 also includes an image generation module 1999 that presents an image of the data store at a past time based, at least in part, on the mapping generated by the mapping module 1997. The image generation module 1999 can also be implemented in one or more elements shown in the processor module 378 of FIG. 10. For example, in one embodiment, the image generation module 1999 is implemented in the host interface 361 shown in FIG. 10.

[0309] The receiver 1961, processor 1962, indexing module 1995, mapping module 1997, and image generation module 1999 can be implemented in a distributed architecture (as shown in fig. 10). In this approach, each processor module 378 is responsible for processing and indexing write commands directed to specific units of storage in one or more data stores. Thus, the indexing module 1995 contained in each processor module 378 stores a record for each write command directed to the memory cell for which the indexing module 1995 is responsible. When an image of the data store at a specified past time is requested, each mapping module 1997 generates a mapping at the specified past time for the portion of the data store for which it is responsible, if any. The mapping is generated using records stored in the corresponding indexing module 1995. Based at least in part on this mapping, the image generation module 1999 in each processor module 378 then presents a portion of the image (if any) of the data store for which it is responsible. In one embodiment, each processor module 378 includes an indexing module 1995, a mapping module 1997, and an image generation module 1999 that are responsible for a common portion of the data store (e.g., the same storage unit).

[0310] The methods described above also allow the storage management device 1938 to include built-in redundancy that increases the reliability of the data storage system. For example, two separate processor modules 378 may be designated to perform the receiving, processing, indexing, mapping, and image generation operations described above for the same storage unit. In one embodiment, the first processor module 378 is used as a primary processing module while the second processor module 378' is a backup, e.g., if a problem occurs with the first processor module 378.

[0311] FIG. 27 illustrates a record index 2009 for a small set of write requests directed to data storage, processed by processor 1962, and recorded by indexing module 1995. Index 2009 includes four records 2010, 2010', 2010 ", and 2010 *, each identified by a unique write request identifier 1287, 1288, 1290, and 1291, respectively. Each record 2010 identifies the target Logical Unit (LUN) to which the associated write command is directed, i.e., a target LUN identification. In addition, each record includes the location(s) of the storage unit on the target LUN, the location of past data that was overwritten, and the time at which the write command was received by the storage management device 1938. In the embodiment shown in FIG. 27, the location of the storage unit is indicated by the Logical Block Address (LBA) and length (i.e., the number of LBAs containing the target storage unit) associated with the write request. Although each record 2010 in FIG. 27 includes a target LUN identification, this identification may be excluded from records where the index itself is limited to a single LUN. Also, in fig. 27, LUN identification is contained in the location of the past data for each record 2010. The target LUN and the LUN where the past data is stored are different in each record 2010 shown in FIG. 27. For example, each write request 1287, 1288, 1290, and 1291 of FIG. 27 is associated with a target LUN identified as LUN2502, while the past data associated with write request 1287, 1288, and 1291 is stored in LUN2500, and the past data associated with write request 1290 is stored in LUN 2501. Although, these examples present a copy-on-write operation, where different LUNs are used to store new data and past data, in practice, new data and old data may be stored on the same LUN. When the target LUN is also used to store past data, all LUN identifications can be excluded from each individual record, e.g., where the index itself is limited to records of a single LUN.

[0312] As for the position value in index 2009, the first value on the left in the "new data" column is the logical block address (i.e., the memory cell) from which the corresponding write operation begins. The second value, the item to the right in the "new data" column, is the length, i.e., the number of memory locations to which a write operation is directed. In the embodiment shown in FIG. 27, the leftmost entry in the "past data" column is the LUN identification of the LUN to which the past data was written. The central entry appearing in the "past data" column is the logical block address at which the past data began to be stored due to the associated copy-on-write operation. The rightmost entry appearing in the "past data" column is the number of memory locations that past data occupies when copied and written to that location. Thus, the index 2009 provides sufficient information to allow the system to identify the specific locations of newly written data and past data associated with each record 2010.

[0313] In one embodiment, the storage unit is a specific 512 byte block, which is part of the LUN, so the length indicates how many 512 byte blocks the write request will operate on. For example, write request 1287 occurs at time (t) 6100. It is directed to the target storage unit in LUN2502, starting at LBA0 and 17 blocks in length. The past data stored at blocks 0-16 is copied and rewritten to blocks 64-80 (i.e., locations 64, 17) of LUN 2500. It should be appreciated that other block lengths may be used.

[0314] Similarly, write request 1288 results in data in blocks 16-20 of LUN2502 being copied to locations 85-89 of LUN 2500. After write request 1288 executes, block 16 has been the target of two write operations at t 6100 and t 6117, while each of blocks 0-15 and 17-20 has been the target of a single write operation. Write request 1290 is the write request for the next record. After its execution, the data in blocks 6-9 of LUN2502 is copied and written to blocks 37-40 of LUN2501, and new data is written to blocks 6-9 of LUN 2502. Here, blocks 6-9 and 16 have been the target of two write operations, while each of blocks 0-5, 10-15, and 17-20 has been the target of a single write operation. Write request 1291 is processed after write request 1290 is processed. As a result of write request 1291, the data in blocks 7-10 is written as past data to blocks 46-49 of LUN2500, while new data is stored in blocks 7-10 of LUN 2502. After write request 1291 executes, blocks 7-9 have been the target of three write operations, blocks 6, 10, and 16 have been the target of two write operations, and blocks 0-5, 11-15, and 17-20 have each been the target of a single write operation.

[0315] FIG. 28 illustrates two simplified exemplary mappings 2100, 2101 generated by a mapping module 1997 from a record 2010, the record 2010 being stored in an index 2009 by an indexing module 1995. The mapping demonstrates how the information provided by the record 2010 is used by the mapping module 1997 to map the location of data stored in the data store at a specified past time. For ease of explanation, the mapping is directed to 20 storage units in a data store. The storage management device 1938 may be used with any size data store, or any number of data stores, and thus it should be appreciated that a data management system employing the storage management device 1938 would not be limited to a single data store having 20 storage units as in this illustrative example.

[0316] Generally, the maps 2100, 2101 are generated for a specified past time and displayed at the time of generation. To accurately reflect write requests that occur after the initial generation of the map, the map may be regenerated or modified after its initial generation. Here, the term "initial generation time" refers to the time when the map was initially created. The term "generation time" refers to the point in time when the map is updated after the initial generation time. Map 2100 is a view of the map at initial generation time t 6127. The map 2100 is created in response to the receiver 1961 receiving a request for an image of the data store at a specified past time t 6106. In the method shown in FIG. 28, the maps 2100, 2101 include only messages regarding units of storage that have been the subject of write requests since a specified past time. Data in other memory locations may be located without mapping because such data is still present in the memory location (i.e., the current memory) to which it was originally written. While not limited to this approach, such an implementation is advantageous because it allows for faster map generation and, therefore, faster image generation.

[0317] In one embodiment, the specification of past time is provided from the host at the time of the request and is received by the receiver 1961 substantially simultaneously. In one version of this embodiment, the mapping module 1997 begins generating the map 2100 substantially simultaneously with the receipt of the request by the receiver 1961.

[0318] Referring to the time stored in the index 2009 of fig. 27, a write request 1287 occurs before a specified past time (t 6106). These location records 2010 are not of interest in generating the map 2100 because, for example, the location of the past data associated with the write request 1287 has been overwritten by the specified past time. However, the mapping is employed for each write request that occurs after a specified elapsed time and before the initial generation time (and, in the case of updated mappings, before the map generation time). For example, each of write requests 1288 and 1290 occurred after the elapsed time and before the initial generation time. Thus, mapping module 1997 will use the records 2010 associated with write requests 1288 and 1290 to generate map 2100. Those write requests that occur after the generation time, of course, may still not exist when the map 2100 is generated. This is true, for example, where the map 2100 is generated substantially concurrently with the request, since in such a case, a write request has not yet occurred. However, as described in more detail below, the mapping module 1997 can update the existing map 2101 to reflect the processing of write requests (and associated copy-on-write operations) that occurred after the initial generation time of the map.

[0319] In fig. 28, the map 2100 includes a pointer to the location of past data of a memory location that has been the subject of a write request since the specified past time t 6106. Thus, blocks 6-9 are mapped to blocks 37-40 of LUN2501, and blocks 16-20 are mapped to blocks 85-89 of LUN 2500. The mapping module 1997 uses the information stored in the index 2009 (FIG. 27) to generate a mapping. In the embodiment shown in FIG. 28, the memory cells of blocks 0-5 and blocks 10-15 are not included in the map because those memory cells have not been the target of a write command since the specified elapsed time t 6106, and are therefore still available directly from the current memory.

[0320] The second map 2101 illustrates how typically the map may change over time to reflect the handling of write requests after the initial generation time. For example: if the memory locations have not been previously mapped, a pointer is added to the mapping of those memory locations that are the target of subsequent write requests. In this example, map 2101 has a generation time 6131, which generation time 6131 reflects write request 1291. Write request 1291 works on blocks 7, 8, 9, and 10 in LUN 2502. Thus, block 10 provides an example of the location to which the mapping update applies. Block 10 represents the addition of a pointer, which is required due to write request 1291. The pointer reflects the fact that the data stored in block 10 at specified past time t 6106 has been moved and is now stored in block 49 of LUN 2500. The remainder of map 2101, including the maps of blocks 6-9, remains unchanged from first map 2100. The mapping of blocks 6-9 remains unchanged because, although a copy-on-write operation is performed on blocks 6-9 at time t 6130, it does not affect the location of the data stored in blocks 6-9 at the specified past time t 6106. This data is still stored in blocks 37-40 of LUN 2501.

[0321] The maps 2100 and 2101 may be stored in any structure that allows for efficient retrieval of mapped data ranges. In one embodiment, maps 2100 and 2101 are stored in a binary tree to allow for quick identification of the blocks contained in the maps and to locate the data source (current location) of the memory locations that have been overwritten since the specified past time. In another embodiment, the mapping is stored in a B + tree. In versions of each of these embodiments, each node of the lookup tree includes a pointer to the data source of the range. Databases, files, and other structures may also be used to store the mappings.

[0322] For ease of explanation, the second map 2101 is considered to be generated at t 6131. It should be appreciated, however, that the map 2101 need not all be newly generated. The map 2101 may be newly generated, but it may also be the result of an update or modification to the map 2100. Thus, map 2100 and map 2101 may exist independently in parallel, or map 2101 may replace map 2100. Further, the storage management device 1938 may automatically generate a mapping update in response to a write request that is indexed after the initial generation time. In addition, the foregoing description with respect to FIGS. 26-28 describes the use of a single target LUN for storing data contained in a data store. Again, it should be understood that in some implementations the data store may include data on multiple LUNs (which are the targets of write requests), store past data, or a combination thereof. Additionally, the data store may include a time store and a current store, each including data stored on a plurality of LUNs.

[0323] In one embodiment, the storage management device 1938 begins processing the mapping where a request specifies an image of the past time. However, generating the mapping may be time consuming, so in one embodiment, the storage management device 1938 uses the mapping to respond to requests for storage units contained in the mapping and searches the index 2009 for locations of storage units not contained in the mapping. If the storage unit is contained in index record 2010, then this information is contained in the map for future reference. If a memory location is not contained in the index, a token is also generated in the map.

[0324] When the mapping is complete, and thus all of the appropriate index records 2010 have been added to the mapping, the storage management device 1938 no longer needs to consult the index 2009 and may only reference the mapping. Likewise, explicit entries in the map indicating that data is currently in memory may be removed from the map to make it more efficient.

[0325] In another embodiment, a flag or other indicator is used to identify the complete mapping. In one version of this embodiment, the index 2009 is not used as a source of data locations to be used to generate an image until the map 2100 is created. Once the map 2100 is complete, it is used as a source of data locations to be used to generate the image, and the index is no longer used. In a version of this embodiment, memory locations not included in the map are not marked.

[0326]System for processing I/O requests

[0327] In general, additional aspects of the invention relate to systems and methods for processing I/O requests. In brief overview, in one embodiment of the invention, a system processes I/O requests directed to at least one logical unit of storage. The system includes an operation memory for storing a plurality of ordered sets of operations, each set associated with an I/O request. The system also includes a processor in communication with the operations memory for arranging the operations stored in the operations memory into a first queue or a second queue. The first queue and the second queue are in communication with the processor. The first queue queues the operation based on the identification of the target logical unit. The second queue queues operations based on the operation type.

[0328] Generally, in one embodiment, a first operation associated with a request is placed on a queue associated with one or more LUNs or a portion of a LUN. The operation is queued on the LUN queue until there are no other operations in the processing of the request directed to the overlapping storage unit (e.g., the overlap may be where two requests are directed to one or more of the same storage units). In other words, in this embodiment, an operation is taken off the LUN queue and processed only if there are no operations on overlapping storage units at the time in the process. The first and remaining operations associated with the request may then be processed without concern for overlap with other operations. Operations for multiple requests may be processed, for example, in batches to improve efficiency. The remaining operations may be placed in order on an operation-specific queue to facilitate such batch processing. Thus, the two types of queues described facilitate the processing of requests without address conflicts.

[0329] FIG. 29 illustrates a system for processing I/O requests in accordance with this aspect of the invention. The host 2234 communicates with the physical storage 2236 through a storage management device 2238. Physical storage 2236 may include one or more Logical Units (LUNs), e.g., LUN 1 through LUN X. Data stored in these LUNs may be presented to the host 2234 through the storage management device 2238. The storage management device 2238 communicates with the host 2234 via a first communication link 2240. The storage management device 2238 communicates with the physical storage 2236 via a second communication link 2242. As with the previously described aspects, the first communication link 2240 may be any type of data communication link, such as a LAN, a storage network, or a bus including fibre channel and Small Computer System Interface (SCSI). Ethernet (e.g., gigabit ethernet) and wireless communications are other possibilities that may be used for the first communication link 2240. In one embodiment, the storage management device communicates the SCSI protocol at the logical layer and can communicate using one or more of various physical layers, including a SCSI bus, fibre channel 2, or iSCSI over Ethernet. On the communication link 2240, the storage management device 2238 acts as if it were a physical storage device 2236, in response to an/O request by the host 2234. I/O requests by host 2234 may include read and write requests to a storage unit.

[0330] Upon receiving an I/O request from a host 2234, the storage management device 2238 generates an ordered set of operations that are processed in order to perform the I/O request. In one embodiment, for example, a write request directed to a unit of storage results in an ordered set of five operations, including: 1) reading existing data stored in a target storage unit; 2) writing existing data to another location; 3) establishing an index for the operation performed in step 2; 4) writing the new data to the target storage unit; and 5) releasing the write request, e.g., generating a positive acknowledgement or confirmation: the write request is completed. Another example is a read request that results in an ordered set of two operations. The first operation is to read the data stored in the target memory location and the second step is to release the read request. In other embodiments, the I/O requests described above are modified to include additional operations that are advantageous to several system configurations. For example, as described above, a write request may include an operation directed to updating a time map. In other embodiments, the number of operations associated with I/O requests may be reduced or reordered as part of the optimization.

[0331] The hardware and software architecture of the storage management device 2238 is advantageous for efficiently processing an ordered set of operations. The storage management device 2238 includes an operating memory 2296, a processor 2262, a LUN queue 2221, and an operation-type queue 2222 that communicate with each other through an internal network 2280. In one embodiment, the LUN queue 2221 comprises a separate queue for each respective LUN contained in the physical memory 2236, e.g., LUN 1 through LUN X. Operation-type queue 2222 comprises a separate queue for organizing operations based on the type of operation to be queued. For example, an indexing queue is used to store index operations from multiple ordered sets. Additionally, the operation-type queues are not dedicated to a single LUN; thus, indexing queues and other operation-type queues may store operations directed to multiple LUNs. Functionally, in one embodiment, the first operation in each ordered set of operations is queued in the appropriate LUN queue. Operations subsequent to the first operation in each ordered set of operations are not queued in the LUN queue. Instead, subsequent operations are queued in an operation-type queue.

[0332] FIG. 30 illustrates a general process employed by one embodiment of the present system. At step 2304, the storage management device 2238 receives an I/O request from the host 2234. For example, in one embodiment, the host interface 361 (FIG. 10) receives the I/O request. At step 2305, the storage management device 2238 generates an ordered set of operations associated with the I/O request. Then, at step 2306, the first operation from the ordered set of operations is placed into the LUN queue, which is responsible for the LUN targeted by the received I/O request. The first operation is taken off the queue and processed. At step 2307, subsequent operations in the ordered set are processed. In one embodiment, the performance of these steps may be accomplished with the embodiments previously described herein. For example, each step may typically be performed in the processor module 378 (fig. 10). More specifically, in one version of this embodiment, the I/O manager 362 performs step 2305 and produces an ordered set of operations, and the LUN queues and operation-type queues may be implemented in memory 296 (FIG. 9), which memory 296 may or may not be included in the I/O manager 362. In one embodiment, operations in the ordered set after the first operation are stored in memory, while the first operation is stored in the LUN queue. Once the first operation is processed, a second operation from the ordered set is pulled from memory and placed into an operation-type queue of an operation type corresponding to the second operation. Once the second operation is processed, a third operation from the ordered set is pulled from memory and stored in the operation-type queue for the operation type corresponding thereto. For each operation associated with an I/O request, the steps of pulling an operation from operation memory 2296, storing it in the appropriate queue, processing the operation, and pulling subsequent operations in the ordered set into the appropriate queue are repeated until all operations resulting from the I/O request are completed.

[0333] Referring now to FIG. 31, a table 2407 of entries 2410 corresponding to I/O requests is illustrated. Each entry includes the time the storage management device 2238 received the I/O request, an identification of the target LUN (e.g., LUN #), the logical block address (or other storage unit) affected by the I/O request (e.g., the target storage unit), the I/O request type, and the ordered set of operations resulting from the I/O request. The storage management device 2238 is capable of handling a tremendous number of I/O requests associated with data storage systems of 1 gigabyte or greater. However, the illustrative table presents a small set of information for purposes of explanation. The entry in table 2407 covers a period of time, at least from t 6100 to t 6130. Two types of I/O requests are contained in table 2407, namely, read requests (1290) and write requests (1286, 1287, 1288, 1289, and 1291). However, the present system may handle various I/O requests such as requests for modification history. In addition, the I/O request has been directed to two different LUNs, namely LUN 2502 and LUN 2503, in the time period covered by table 2407.

[0334] Table 2407 includes an ordered set of operations associated with each I/O request. The sets of operations appear in the order they are processed in columns numbered 1-5. For example, I/O request 1288 is a write request that includes five ordered operations: 1) reading existing data in a target storage unit; 2) writing existing data to another location; 3) establishing an index for the operation performed in step 2; 4) writing the new data to the target storage unit; and 5) releasing the write request. In another embodiment, the write request includes a different set of ordered operations. For example, in a system using time mapping, a write request may include six ordered operations: 1) reading existing data in a target storage unit; 2) writing existing data to another location; 3) establishing an index for the operation performed in step 2; 4) writing the new data to the target storage unit; 5) updating one or more time maps as necessary; and 6) releasing the write request. Further, the number of ordered operations in an I/O request-type may be expanded by dividing one or more of the ordered operations into sub-operations. For example, operation 5 of the immediately preceding ordered set may be divided into one operation that is instructed to determine whether a time map has been previously generated, and another operation that is instructed to map updates. Additionally, the steps may be performed out of order, e.g., as described herein with reference to optimization.

[0335] Fig. 32 provides a simplified diagram that will now be used to explain the operation of the storage management device 2238 by using the simplified example of fig. 31, the storage management device 2238 including LUN queues and operation-type queues. The data in the table of fig. 32 corresponds to the information in the table 2407 of fig. 31. The information in the leftmost column indicates when the storage management device 2238 received the associated I/O request. The columns labeled LUN 2502 and LUN 2503 represent two LUN queues. The right half of fig. 32 depicts an operation-type queue. Four types of operation-type queues are shown: 1) queues for operations that write existing data from a target storage unit to another location (these queues are also referred to as "write-existing data" queues); 2) an index queue for queuing operations that record locations resulting from completion of previous write operations; 3) a write new data queue for queuing operations to write new data to the target storage unit; and 4) a release queue for queuing operations indicating that a previous operation in the ordered set has completed.

[0336] The contents of the queue represent the individual operations from the ordered set of operations displayed in table 2407. Each operation is identified by the I/O request that generated the operation, and the number of bits to the right of the hyphen that the operation occupies in its ordered set. Thus, the fourth operation (i.e., write new data operation) in the ordered set of operations resulting from I/O request 1286 is represented in FIG. 32 as 1286-4. As another example, the first operation in the ordered set of operations resulting from I/O request 1288 is represented as 1288-1.

[0337] At time t 6100, I/O request 1286 is received by storage management device 2238. Since the I/O request 1286 produces an ordered set of operations (i.e., 1286-1, 1286-2, 1286-3, 1286-4, and 1286-5), which corresponds to the set of operations shown in fig. 31 at t 6100. Operations from the ordered set are stored in an operations memory 2296. Starting with the first operation in the ordered set, each operation of the ordered set is moved into a queue, one at a time, and processed. Thus, at t 6100, operation 1286-1 is placed into the LUN2502 queue and operations 1286-2, 1286-3, 1286-4, and 1286-5 are stored in operation memory 2296. The first operation (operation 1286-1) is stored in the LUN2502 queue because I/O request 1286 is directed to LUN 2502.

[0338] In fig. 12, the processing state of the storage management device 2238 is then checked at t 6119. By this time, the storage management device 2238 has received two additional I/O requests, namely 1287 and 1288 (at t 6114 and t 6117, respectively). Also, operation 1286-1 (i.e., reading data present in the target storage unit) has been processed. Thus, operation 1286-2 has been identified and stored in the write-one existing data queue. Because operation 1286-1 was processed, it is no longer stored in the LUN2502 queue. However, both requests 1287 and 1288 are directed to LUN 2502. Thus, the LUN2502 queue now includes the first operation from each of these two pending (pending) I/O requests. These two operations will be performed in the order in which they were received by the storage management device 2238, i.e., 1287-1 followed by 1288-1 as long as there are no requests in the processing of overlapping storage units.

[0339] The storage management device 2238 may include such lookup trees, algorithms, and other systems and methods described in greater detail herein to efficiently and accurately process I/O requests. In one embodiment, the storage management device 2238 uses an overlap detection process to determine whether a newly received I/O request targets any storage units that are also targets of one or more I/O requests currently being processed. If so, the first operation of the newly received I/O request in the ordered set will be held in the appropriate LUN queue until all operations of the previous I/O request are processed. However, as here, where newly received I/O requests (i.e., 1287 and 1288) do not target any of the same target units of storage as previously received I/O request(s) (e.g., 1286), the storage management device 2238 may process the operations after the first operation in the multiple ordered sets (e.g., 1286, 1287, and 1288) together. To facilitate the previously described processing, the storage management device 2238 can include systems and methods described in more detail herein to batch process operations queued in the operation-type queue. Thus, operations may be kept in the operation-type queue until they are enqueued by other operations of the same type, thereby increasing the overall processing speed and efficiency of the storage management device 2238.

[0340] At time t 6122, the storage management device 2238 has processed the operation 1286-2 (write existing data) and determined that the requests 1286, 1287, and 1288 are directed to non-overlapping portions of the target LUN 2502, sequentially processing the operations 1287-1 and 1288-1, and receiving two more I/O requests (i.e., 1289 and 1290). The first operation from each newly received I/O request (i.e., 1289-1 and 1290-1) is stored in the LUN 2502 queue. As operations 1287-1 and 1288-1 are processed, they are removed from the LUN queue. Operation 1286-2 has been removed from the write existing queue and operation 1286-3 has been pulled from the operation memory 2296 and stored in the index queue. Similarly, operations 1287-2 and 1288-2 have been pulled from the operation memory 2296 and stored in the write-existing queue.

[0341] The queue view at t 6124 demonstrates a simplified example of the batch processing method described above. Between t 6122 and t 6124, operations 1287-2 and 1288-2 are removed from the write-existing queue and processed together. As a result, operations 1287-3 and 1288-3 are popped from operation memory 2296 and stored in the index queue where they join operation 1286-3, which has not yet been processed. With respect to operations in the LUN queue, operation 1289-1 is processed, and thus, operation 1289-2 is popped from operation memory 2296 and stored in the write-existing data queue. However, because there is an overlap in the units of storage targeted by I/O requests 1289 and 1290 (i.e., blocks 26-28 as listed in FIG. 31), operation 1290-1 will not be processed until all operations of I/O request 1289 have been processed. Meanwhile, operation 1290-1 is held in the LUN 2502 queue, and operations 1290-2, 1290-3, 1290-4, and 1290-5 will be held in operation memory 2296.

[0342] The three operations in the index queue (i.e., 1286-3, 1287-3, and 1288-3) are now processed together. After the three index operations are completed, the corresponding write-new-data operations (i.e., 1286-4, 1287-4, and 1288-4, respectively) are ejected from the operations memory 2296 and stored in the write-new-data queue at t 6125. And at t 6125, an I/O request 1291 directed to LUN2503 is received by the storage management device 2238. The first operation from the ordered set resulting from request 1291 is stored in the LUN2503 queue. Further, at t 6125, no other operations in the queue are directed to LUN 2503; thus, operation 1291-1 is stored as the first operation in the LUN2503 queue. Subsequent operations (i.e., 1291-2, 1291-3, 1291-4, and 1291-5) are stored in the operations memory 2296. At this time, each of the two LUN queues shown in FIG. 32 includes a single operation. Although operation 1291-1 was received at a later time, it can be processed before operation 1290-1 because there were no operations in the LUN2503 queue before 1291-1, and in this example, no operations for LUN2503 were processed. In contrast, operation 1290-1 will remain in the queue until all operations associated with I/O request 1289 are completed (i.e., 1289-2, 1289-3, 1289-4, and 1289-5).

[0343] At time t 6127, operation 1291-1 has been processed because each operation has been stored in the operation-type queue at t 6125. As a result of this processing, operations 1286-5, 1287-5, and 1288-5 are popped from operation memory 2296 and moved to the release queue. At this time, the operations associated with I/O requests 1286, 1287, and 1288 are no longer stored in operation memory 2296. Also, operation 1289-4 is popped from the operation memory and stored in the write-new data queue, and operation 1291-2 is popped from the operation memory and stored in the write-existing data queue. It should be appreciated from this example that the operation-type queue can be used to service multiple LUNs. For example, operation 1291-2 can be processed (including batch processing) with operations directed to LUN 2502, or any other combination of LUNs, that are being serviced by the storage management device 2238.

[0344] By time t 6129, the first request of the example I/O request is completed. Release operations 1286-5, 1287-5, and 1288-5 are processed together. Each release operation provides a system acknowledgement that the associated I/O request is completed. Once the release operation is processed, the corresponding I/O request is completed, and neither the LUN queue nor the operation-type queue stores any operations associated with the completed I/O request. Thus, at t 6129, the operation-type queue includes only operations 1291-3 in the index queue and 1289-5 in the release queue. Because processing of I/O request 1289 is incomplete, operation 1290-1 remains in the LUN 2502 queue.

[0345] Referring now to FIG. 33, in a functional description of the system elements, the storage management device 2538 includes an operation generator 2525. Operation generator 2525 receives the I/O request originating from host 2534. As described previously, for each I/O request, the set of ordered operations is determined by the I/O request type. In one embodiment, operation generator 2525 determines the I/O request type upon receipt of the I/O request. Based on the I/O request type, operation generator 2525 extracts an ordered set of operations from each I/O request received from host 2534. In one embodiment, the operation generator 2525 is included in the processing module 378 (FIG. 10) of the storage management device 2538. In a version of this embodiment, the operation generator is included in the target mode driver 382 of fig. 10. The storage management device 2538 also includes an operations pool 2524 that stores each operation fetched before the operation is moved to the queue. In one embodiment, the operations pool 2524 is contained in operations memory 2296. In a version of this embodiment, the operating memory is contained in the buffer 363 of fig. 10.

[0346] The storage management device 2538 includes both the LUN queuing module 2521 and the operation-type queuing module 2522. The LUN queuing module 2521 receives the first operation from each ordered set of operations, from the operations pool 2524, and stores it in the appropriate LUN it is processing. In the embodiment shown in fig. 33, the LUN queuing module 2521 includes a processing management module 2526. In one embodiment, in general, the process management module 2526 manages the processing of operations stored in the LUN queues. More specifically, the processing management module 2526 ensures that operations stored in the LUN queues are processed in such a way that when subsequent operations in the ordered set are pulled to the operation-type queue, they are idempotent with respect to any other operations stored in the operation-type queue. The processes used by the process management module 2526 are described in more detail elsewhere herein. However, in one embodiment, the processing management module 2526 employs a search tree data structure to organize the order of execution of the operations stored in the LUN queues 2221. In another embodiment, the processing management module employs a fairness algorithm to ensure operations directed to LUNs that receive low-volume I/O requests are processed in a timely manner. In a version of this embodiment, the processing management module 2526 monitors the amount of time each pending operation is stored in the LUN queue.

[0347] The operation-type queuing module 2522 receives operations subsequent to the first operation in each ordered set from the operation pool 2524 and stores them in the appropriate operation-type queues. The operation-type queuing module also includes a batching module 2528. A batching module 2528 may be used to optimize the processing of operations stored in the operation-type queues. For example, two pending operations directed to adjacent memory locations may be processed in a single batch, thereby reducing the number of read and write operations that must be performed by the physical memory. Thus, to increase overall processing speed, batch processing may include delaying the processing of pending operations until a larger batch is obtained.

[0348] The storage management device 2538 also includes an indexing module 2523. The indexing module 2523 generates a location record for the data that was moved due to the copy-on-write operation. The indexing module 2523 may be included in the I/O manager 362 of FIG. 10. In one embodiment, an index queue (e.g., as shown in FIG. 32) stores operations that result in the creation of records in the indexing module 2523.

[0349]Overlay detection

[0350] A storage management device implemented in accordance with at least some aspects of the disclosed technology may improve the performance of an enterprise information technology infrastructure by efficiently processing I/O requests directed to particular logical storage units and/or portions thereof from host processors within the enterprise. In contrast to conventional storage interactions that require a host processor (or processing thread) to wait to complete an I/O request to a storage device, the disclosed techniques enable a storage management device to acknowledge completion of an I/O request to a host processor, where at least some such I/O requests are not actually completed, but are queued (based on, for example, their receipt time and their target logical storage units), and where the queuing order of the respective operations has been optimized to minimize the number of disk accesses performed and thus improve the performance of the enterprise storage system.

[0351] As a non-limiting example of such optimization, in response to receiving a write request directed to a particular location in a logical storage unit and a subsequent read request directed to the same (or partially overlapping) location, a storage management device incorporating at least some aspects of the disclosed technology may determine that there is overlap between the requests and suppress the read request from executing until after the write request is completed. As another example, if overlap is detected, by using data in temporary storage, reads may be serviced before writes are completed, e.g., stored data may be subsequently read from RAM (rather than from a relatively low speed disk), thereby reducing the total number of disk accesses.

[0352] Also, in some I/O request processing, such as described elsewhere herein, the processing of I/O requests may be enhanced by limiting the parallel processing of I/O requests (e.g., rather than as part of a special optimization) to I/O requests directed to non-overlapping units of storage (e.g., blocks). Such processing may thus be improved by efficiently determining whether there are I/O requests directed to overlapping units of storage without, for example, reviewing all pending I/O requests and using this information to determine whether an I/O request should be processed or queued. Thus, in addition to enabling optimization as described above, efficiently providing information resources (such as lists, databases, tree structures, linked lists, or other resources) regarding locations targeted by pending I/O requests may allow the storage management system to process I/O requests more efficiently because the storage management system may limit parallel processing to I/O requests targeting non-overlapping units of storage.

[0353] Referring now to FIG. 34, an illustrative storage management device (not shown) may include one or more software processes 2602 (e.g., scheduler software processes), the software processes 2602 receiving and storing I/O requests 2604, 2606 in request queues 2608, 2610 associated with a particular logical storage unit 2612, 2614, or portion thereof, such requests 2604, 2606 targeting the logical storage unit 2612, 2614, or portion thereof. I/O requests 2604 within a particular request queue 2608 are preferably organized to ensure that the requests 2604 are processed (or placed within the queue 2608) in the order in which they were received (e.g., I/O requests 12604 'received at time T1 are placed before I/O requests 22604' received at a subsequent time T2). The request queues 2608, 2610 may also preferably be configured to store requests 2604, 2606 associated with a particular logical storage unit 2612, 2614, or portion thereof. I/O requests 2604 in a particular 2608 queue are aligned to various overlapping and/or non-overlapping address ranges in logical storage unit 2612. For example, the address range (address 0 to address 15)2616 'associated with I/O request 2604' directed to logical storage unit 2612 may overlap with another address range (address 8 to address 11)2616 * associated with another I/O request 2604 *. Similarly, the address range (address 0 to address 15)2616 'associated with I/O request 2604' may be different from the address range (address 16 to address 32)2616 "associated with another I/O request 2604" and thus the former does not overlap the latter.

[0354] The queued I/O requests 2604, 2606 may further be associated with one or more sequences of operations 2618 that specify an order in which certain operations 2620 should be performed to fulfill the respective I/O requests 2604, 2606. The scheduler software process 2602 may organize the operations 2620 associated with the queued I/O requests 2604, 2606 in the corresponding operation queue 2622 and may further perform such queued operations 2620 in a manner that optimizes the performance of the storage devices associated with the targeted logical storage units 2612, 2614 (such as by, for example, minimizing the number of disk accesses to such storage devices). To ensure that operations 2620 queued in one or more operation queues 2622 execute in a manner that is consistent with the time that the respective I/O request 2604, 2606 was received and that results in performance optimization, the scheduler software process 2602 may search the queued data structure 2624 (e.g., a binary tree and/or other type of tree data structure) to determine whether the operations 2620 are associated with non-overlapping address ranges (e.g., 2616 'and 2616 ") or whether one or more operations 2620 are associated with overlapping address ranges (e.g., 2616' and 2616 *). If the address ranges 2616 overlap, the scheduler software process 2602 splits one or more nodes 2626 within the binary tree 2624 such that each node 2626 is associated with a non-overlapping address range.

[0355] In one illustrative embodiment, each node 2626 in the binary tree data structure 2624 that may be searched by the scheduler software process 2602 in accordance with at least some aspects of the disclosed technique may include: an identifier of logical storage unit 2612, a pointer to a list (e.g., a linked list) of I/O requests 2604, a pointer and/or identifier to one or more sequences of operations 2618, a pointer and/or identifier to a particular operation 2620 within a sequence of operations 2618, a pointer to a non-overlapping range of addresses 2616 within logical storage unit 2612, a pointer to a parent node (if a parent node exists, otherwise null), and/or a pointer to a child node (if a child node exists, otherwise null). The data and pointers associated with each node are used not only to form interrelationships within the tree data structure 2624, but also to facilitate the searching and retrieval of relevant data by the scheduler software process 2602 when determining whether a particular I/O request 2604 and/or associated operation 2620 is directed to overlapping/non-overlapping address ranges 2616 within the logical storage unit 2612 or portions thereof.

[0356] In one illustrative operation, and referring now to FIG. 35, a scheduler software process 2602 of a storage management device (not shown) receives I/O requests 2604, 2606 directed to one or more logical units of storage 2612, 2614, or portions thereof, from one or more hosts. The scheduler software process 2602 establishes a request queue 2608 for each logical storage unit 2612, if such request queue 2608 does not already exist, and the scheduler software process 2602 stores the I/O requests 2604 (or a flag associated therewith) according to the time at which the I/O request 2604 targeting such logical storage unit 2612 was received (2702). The scheduler software process 2602 evaluates the queued I/O requests 2604 to obtain and/or form the data and pointers discussed above for forming the nodes 2626 of the queued data structure 2624 such that each node 2626 is associated with a non-overlapping address range 2616 (2704).

[0357] By way of non-limiting example, the scheduler software process 2602 may extract and/or form identifiers and/or pointers associated with one or more of the logical storage units 2612, the queued I/O requests 2604, the operations 2620 and sequences of operations 2618 associated with the I/O requests 2604, the address ranges 2616 specified by the I/O requests 2604, and/or otherwise obtain any other information necessary or desirable for forming the nodes 2626 of the binary tree data structure 2624. If two or more queued I/O requests 2604 are directed to overlapping address ranges 2616, the scheduler software process 2602 can form a node 2626 that includes corresponding non-overlapping address ranges. For example, if a first I/O request 2604 ' is aligned with an address range 2616 ' (addresses 0 through 15) of a first logical unit 2612 and a second I/O request 2604 * is aligned with an overlapping address range 2616 * (addresses 8-12), then the scheduler 2602, for example, may form three nodes with associated address ranges that do not overlap, i.e., the first node may be associated with addresses 0 through 7 (which are further associated with the first I/O request 2604 '), the second node may be associated with addresses 8 through 12 (which are further associated with the first and second I/O requests 2604 ', 2604 *), and the third node may be associated with addresses 13 through 15 (which are further associated with the first I/O request 2604 '). In this manner, the scheduler 2602 ensures that each node corresponds to a distinct, non-overlapping range of addresses within the logical storage unit, regardless of whether the I/O request specifies overlapping or non-overlapping ranges of addresses. Once node 2626 is formed, scheduler 2602 arranges the node into a data structure 2624 (e.g., a binary tree) using, for example, parent and/or child pointers to other nodes, which data structure 2624 may, but need not, display substantially adjacent address ranges 2616(2706) within logical store 2612.

[0358] The scheduler 2602 may perform operations 2620(2708) associated with the I/O request 2604 by first searching the binary tree 2626 to verify that no I/O requests with overlapping address ranges are contained within its node 2626 preceding the request. By queuing operations in the operation queue as described above, the execution of operations associated with a request can be staged. For example, a write operation associated with an I/O request may be performed, while another write operation directed to the same or overlapping address specified in a subsequently-occurring I/O request may be performed after the first write operation is completed, such that processing of the two requests occurs in an orderly manner.

[0359] In one embodiment, the operations 2620 queued by the scheduler 2602 are based on one or more batches of I/O requests 2604 received during a particular time period. In another embodiment, the operations 2620 queued by the scheduler 2602 may occur substantially in real time as the I/O request is received. In yet another embodiment, the scheduler 2602 may initially queue the operations 2620 in batch mode and then reorder the operation queue 2620 based on the I/O requests 2604 received substantially in real-time. Regardless of the particular queuing method implemented, the scheduler 2602 can maintain and update the binary tree data structure 2624 by adding, moving, and/or splitting nodes within the binary tree data structure as corresponding I/O requests 2604 are added, processed, and/or moved. For example, the scheduler 2602 may move one or more nodes 2626 from the binary tree 2624 if the corresponding I/O request is complete and such nodes are not otherwise associated with other I/O requests that have not been executed so far (2710). If a new I/O request is received and not directed to addresses that overlap those already in the binary tree 2624, the scheduler 2602 can expand the binary tree 2624 by forming new nodes corresponding to non-overlapping addresses of the new I/O request and can add such new nodes to the binary tree 2624, which can (but need not) subsequently cause a rearrangement of operations within the operation queue 2622. If a new I/O request is received and directed at addresses that overlap those already in the binary tree 2624, the scheduler 2602 can split one or more existing nodes 2626 in the binary tree into multiple nodes to ensure that each node 2626 in the binary tree 2624 contains non-overlapping addresses (note that splitting a node is faster than creating a new node and incorporating the new node into the binary tree 2624) (2714).

[0360] In this manner, the binary tree 2624 remains substantially current and can support the queuing operations being performed by the scheduler 2602, particularly determining whether a newly received I/O request is associated with an address overlapping an address of an operation 2620 address, the operation 2620 having been queued in one or more operation queues 2622. When a new I/O request is received, the scheduler 2602 may quickly search the nodes 2626 of the binary tree 2624 to determine whether there is any overlap in the address range specified by the new I/O request relative to the address ranges associated with existing and/or queued requests and/or operations. As previously discussed, operations associated with a newly received I/O request and having non-overlapping addresses relative to addresses in binary tree 2624 may be queued without undue concern that such operations will be performed out of order, while overlapping addresses require more careful consideration to ensure that the operations are performed in the correct order to avoid data corruption issues.

[0361] Referring now to the exemplary embodiment of the binary tree illustrated in fig. 36A, the scheduler software process 2602 may form the first node 2802 (i.e., node 0) of the binary tree data structure 2624 (fig. 34) by, for example, associating information related to the I/O request showing the earliest receive time (i.e., I/O request 0) with the first node 2802. As above, the associated information may include the following: an identifier 2804 of a logical storage unit targeted by the I/O request, one or more pointers 2806 to one or more I/O requests, one or more pointers 2808 to operations and/or sequences of operations associated with the I/O request, and/or one or more pointers 2810 to non-overlapping address ranges associated with the I/O request. The node 2802 may also include a pointer 2812 to a parent node if such parent node exists (otherwise, a pointer to null), and pointers 2814, 2816 to one or more child nodes if such child nodes exist (otherwise, a pointer to null). One of the child pointers 2814 may then be redirected to a child node associated with a smaller address range, while the other child pointer 2816 may be redirected to a child node associated with a larger address range.

[0362] Referring now to FIG. 36B, the scheduler 2802 may extend the binary tree by, for example, forming a new node 2818 associated with another subsequently received I/O request (i.e., I/O request 1) that aligns with an address range 2820 (i.e., addresses 16-32) that does not overlap with the address range (i.e., addresses 0-15) of the existing node 2802. To maintain clarity of the figure, FIGS. 36B-36D do not repeat all information associated with the described nodes (previously described in connection with node 2802 in FIG. 36A), but those skilled in the art will recognize that similar information will exist for each such node.

[0363] Referring now to FIG. 36C, the scheduler 2602 may expand the binary tree by splitting one or more existing nodes 2802 in response to receiving a new I/O request directed to an address range (i.e., addresses 8-11) that overlaps the address ranges (i.e., addresses 0-15) associated with one or more such existing nodes 2802, where each node resulting therefrom in the binary tree is organized such that they are associated with non-overlapping address ranges. For example, the node 02802 of FIG. 36B, originally associated with addresses 0-15, may be split into two additional nodes 2822, 2824 (i.e., nodes 2 and 3), the address ranges of the nodes 2822, 2824 (i.e., addresses 0-7 and 8-11, respectively) not overlapping the updated address range of the node 02802 (i.e., addresses 12-15). Pointers, identifiers, and/or other information associated with each of the nodes 2802, 2818, 2822, 2824 can be updated as needed to reflect the updated tree structure. For example, address range pointer 2810 in node 02802 may be modified to point to address ranges 12-15 within a particular logical storage unit, address range pointer 2826 of node 22822 may be formed and aligned with address ranges 0-7 within the logical storage unit, I/O request pointer 2828 of node 22822 may be formed and directed to I/O request 0, address range pointer 2830 in node 32824 may be formed and aligned with address ranges 8-11 within the logical storage unit, and two I/O request pointers 2832 of node 32824 may be formed and directed to I/O requests 0 and 2 (since both requests are aligned with addresses 8-11). Similarly, other node information, such as pointers and identifiers directed to associated sequences of operations, and/or parent or child nodes, may be updated to form an updated binary tree data structure.

[0364] Referring now to FIG. 36D, the scheduler 2602 may modify the binary tree by moving one or more nodes when the corresponding I/O request is complete. For example, when I/O request 0 completes, node 02802 and node 22822 of FIG. 36C may be removed from the binary tree because such nodes do not reference any other I/O requests (i.e., their I/O request pointers 2806, 2828 are only aligned with I/O request 0). The remaining nodes 2818, 2824 in the binary tree may be reorganized to reflect the new tree hierarchy, and their association information may be similarly updated to reflect their independence from the removed nodes 2802, 2822. For example, I/O request pointer 2832 of node 32824 may be updated to point only to I/O request 2 and not to I/O request 0 because I/O request 0 has been fulfilled, and the parent and child pointers of nodes 1 and 32818, 2824 may be modified to reflect the new binary tree hierarchy.

[0365] While the embodiment discussed above in relation to fig. 36A-36D is relatively simplistic to maintain clarity of the disclosure, one skilled in the art will recognize that the disclosed techniques may be applied to a large number of I/O requests that may exhibit various types of interaction affecting a plurality of logical storage units, where each such logical storage unit (or portion thereof) includes a set of nodes arranged in a distinctly different binary tree. As previously discussed, these binary trees enable one or more schedulers 2602 to quickly search the address range pointers of the binary tree for the address range specified by the newly received I/O request to determine if any pending I/O requests whose operation may be in-process or queued for processing overlap with the address range of the newly received I/O request. The scheduler may thus use the search results to quickly determine whether it is possible to begin performing the operation associated with the request. This efficiency is beneficial to performance for a large number of requests. The disclosed techniques may also be used for other types of queued data structures and/or other types of commands/requests.

[0366]Check point

[0367] In one embodiment, the storage management device may be used to set up checkpoints for copy-on-write sequences of operations, and these checkpoints are useful for real-time recovery from storage management device failures. For example, in designing a storage management device with redundancy, there may be one primary processing module assigned to process I/O operations directed to a particular data store, and one or more secondary processing modules that can complete the processing of any in-process I/O operations by the primary processing module upon detection of an error or failure in the primary processing module. Upon taking over information useful for the primary processing module to successfully process outstanding I/O operations, embodiments of the disclosed technology enable such secondary processing modules. At the same time, embodiments of the disclosed technology facilitate the use of these checkpoints in a manner that is integrated with the storage of other transactional information, lightweight, and easy to communicate.

[0368] In addition, embodiments of the disclosed technology facilitate utilization of processing optimizations by the primary processing module because the secondary processing module does not need to know any optimizations attempted by the primary processing module to successfully replace the primary processing module in the event of a failure, and the secondary processing module can use the disclosed checkpoint information to determine what processing the secondary processor needs to complete for any outstanding I/O operations. This is particularly advantageous in large systems with multiple data stores, where there may be thousands, tens of thousands, or more outstanding I/O transactions at any given time.

[0369] In one illustrative embodiment and referring to fig. 37 and 38, the storage management device 2938 may intercept/receive an I/O request 2904 (e.g., a write request, a read request, etc.) from a host 2934 that targets a particular current memory 2944 (3002), and the storage management device 2938 may identify, in response to the request, a particular type of operation sequence associated with the I/O request 2904 from a number of such operation sequence types 2910 (e.g., a write request sequence 2912, a read request sequence 2914, etc.) (3004). By way of non-limiting example, an exemplary write request sequence 2912 may include the operations discussed below with respect to blocks 3006 and 3010 and 3014 and 3018 of FIG. 38.

[0370] The storage management device 2938 parses the write request 2904 to extract the storage device's identifier 2916, and a location 2918 (including, for example, a particular starting address and data length) within the current memory 2944 to which the current data specified by and/or included with the write request will be written 2918. The storage management device 2938 reads data 2920 (referred to herein as "original data") stored at location 2918 within the current store 2944 (3006) and copies such data 2920 to a destination location 2922(3008) in the time store 2946 associated with the selected storage device. Transaction information 2926 associated with the write request 2904 is recorded in one or more data structures, files, and/or databases (not shown) and may include, for example, a device identifier 2916 associated with the current store 2944 and/or the time store 2946, a write request identifier 2928 uniquely identifying the write request 2904, locations 2918, 2922 within the current store 2944 and the time store 2946 affected by the write request 2904, a time 2930 at which the write request 2904 was received, and/or other types of information (3010) associated with the write request 2904. Transaction information 2926 may be recorded before, after, or at the same time that data 2920 is copied to destination location 2922.

[0371] If the original data 2920 was not successfully copied to the destination location 2922, and/or if the transaction information 2926 was not properly recorded, the storage management device 2938 will generate an error message that may be communicated to a user of the storage management device 2938 and/or other entities or software processes associated therewith (3012). Otherwise, upon successfully copying the data 2920 and recording the transaction information 2926, the storage management device 2938 generates an indicator 2932 (referred to herein as an "index checkpoint"), the indicator 2932 confirms that the data copy and transaction information recording operations have completed successfully, and the index checkpoint 2932 is subsequently stored or recorded, e.g., as part of the transaction information 2926 (3014).

[0372] After generating and storing the index checkpoint 2932, the storage management device 2938 writes the current data (also referred to as "payload data") specified by the write request 2904 to the appropriate location 2918(3016) within the current memory 2944. If the current data is not successfully written, an error message may be generated 3012. Otherwise, the storage management device 2938 generates an indicator 2933 (referred to herein as a "release checkpoint"), the indicator 2933 confirms that the current data has been successfully written to the desired location 2918 in the current memory 2944, and the release checkpoint 2933 is subsequently stored/recorded as part of the transaction information 2926 (3018). Index checkpoint 2932, release checkpoint 2933, and/or other transaction information 2926 may be generated for each write request and/or other type of storage transaction event, and thus may be used to recover from storage transaction failures (e.g., power failures, hardware failures, data corruption events, etc.) having granularity that enables data recovery, storage command queue regeneration/synchronization, and/or storage system reconfiguration to occur, for example, at a time that is substantially consistent with just prior to the occurrence of a storage transaction failure.

[0373] The index and release checkpoints 2932, 2933 may be used to enhance fault tolerance of the storage system, particularly with respect to hardware and/or power failures that may affect processor modules or other types of devices writing to and/or reading from storage units. For example, a fault tolerant system that includes one primary processor module and one or more backup processor modules may benefit from the disclosed techniques in situations where the primary processor module fails and one of the backup processor modules assumes primary control over interactions affecting one or more storage units by having the storage, command/operation queues within the backup processor module substantially equal to the storage, command/operation queues of the primary processor module at a point in time just prior to or consistent with its failure. In this way, the standby processor module may assume its responsibility without having to re-execute commands or complete other operations that may have been completed by the main processor module prior to its failure and that otherwise may not have to be passed to the standby processor module. The disclosed techniques may also be used to replicate a history of queued I/O requests and/or associated operations for analysis or other purposes.

[0374] In one illustrative embodiment and referring now to FIG. 39, a standby processor module (not shown) may include one or more request queues 3102, the request queues 3102 containing, for example, I/O requests 3104 that are received at a particular time and that target a particular address and/or address range 3106 of one or more logical storage units 3108. I/O requests 3104 in a particular request queue 3102 'may, but need not, be organized to work with data stored at addresses in a particular logical unit 3108', while I/O requests in other request queues 3102 "may be organized to work with data stored at addresses in a different logical unit 3108". The standby processor module may also include one or more operation type queues 3110, which operation type queues 3110 may, for example, include operations associated with I/O requests 3104 in one or more request queues 3102. Each operation queue 3110 may, but need not, contain only operations of a particular type. First illustrative operation queue 3110I may contain a plurality of operations where one or more of such operations are associated with an I/O request 3104 "(corresponding to, for example, a write request) and include reading original data from a first address range 3106" of a logical unit of storage 3108' associated with current storage 2944 (FIG. 37). Second illustrative operation queue 3110ii may contain a plurality of operations where one or more of such operations are associated with I/O request 3104 "and include copying original data from first address range 3106" of current memory 2944 to a location in time memory 2946. The third illustrative operation queue 3110iii may contain a plurality of operations where one or more of such operations are associated with I/O request 3104' and include log transaction information 2926 (FIG. 37). The fourth illustrative operation queue 3110iv may contain a plurality of operations where one or more of such operations are associated with I/O request 3104' and include generating index checkpoint 2932. A fifth illustrative operation queue 3110v contains a plurality of operations, where one or more of such operations are associated with I/O request 3104 ' and include an address range 3106 ' for writing payload data to logical unit 3108 '. A sixth illustrative operation queue 3110vi contains a plurality of operations, where one or more of such operations are associated with I/O request 3104 ' and include an address range 3106 ' confirming that the payload data was successfully written to logical storage unit 3108 '. The seventh illustrative operation queue 3110vii may contain a plurality of operations where one or more of such operations are associated with I/O request 3104' and include generating a release checkpoint 2933.

[0375] In one illustrative recovery process that recovers from a hardware/power failure using index checkpoint 2932 and/or release checkpoint 2933, and referring now to fig. 39 and 40, storage management device 2938, a storage system administrator, and/or other type of entity assigned the task of monitoring and/or recovering from such a failure may detect error messages and/or other types of error flags that indicate a hardware failure and/or a power failure. To ensure that the contents of the request queue 3102 and the operation queue 3110 of the standby processor module coincide with the contents of the respective queues of the main processor module that is failing at the moment, the storage management device 2938 may evaluate each I/O request 3104 in its request queue 3102, based at least in part on the respective index and/or release checkpoints 2932, 2933, to determine whether such I/O request 3104 was previously fulfilled or partially fulfilled by the main processor module prior to its failure. Upon making such a determination, the storage management device 2938 may modify the request queue 3102 and/or the operation queue 3110 of the standby processor module to substantially conform to the I/O requests and associated operations queued therein prior to the failure of the primary processor module.

[0376] For example, the storage management device 2938 may search the standby processor module's request queue 3102 to identify one or more I/O requests 3104(3202) that have been queued prior to the failure of the primary processor module. For each identified I/O request, the storage management device 2938 may determine whether an associated index checkpoint 2932 exists by, for example, searching for such index checkpoint 2932 in a data structure, file, database, and/or other type of data repository communicatively coupled to the storage management device 2938 (3204). In one embodiment, the checkpoint is recorded along with other information about the write request in a database that stores the location of the rewritten data and other information described above.

[0377] If the associated index checkpoint 2932 is not located (in the case of a copy-on-write request, indicating that the original data was not successfully copied from the current store 2944 to a location within the temporal store 2946), the storage management device 2938 may queue a full set of operations associated with the I/O requests 3104 within one or more operation queues 3110 of the standby processor module for subsequent execution (3206). Otherwise, the storage management device 2938 may determine whether an associated release checkpoint 2933 exists by, for example, searching for such release checkpoint 2933 in the data repository described above (3208). If an associated release checkpoint 2933 is not located, the storage management device 2938 may queue a subset of the operations associated with the I/O request 3104 within one or more of the operation queues 3110 of the standby processor module (3210). For example, and where the I/O request corresponds to a copy-on-write sequence of operations, a subset of the queued operations may include operations: writing the payload data specified by the I/O request to a particular location within logical storage unit 3108, confirming that the payload data was successfully written, and/or generating a release checkpoint associated with such request. Otherwise, and if the associated release checkpoint 2933 is located (indicating that the primary processor module has fully fulfilled the I/O request prior to its failure), the storage management device 2938 may remove the operations associated with such I/O request from the operation queue 3110 of the standby processor module (3212).

[0378] The above method may be repeated for each I/O request 3104 in the standby processor module's request queue 3102, and thus make the standby processor module's queue consistent with the corresponding queue of the now-failed main processor module. In this manner, the request and operation queues 3102, 3110 of the standby processor module are cleared of stale requests and operations, thereby minimizing, and perhaps entirely eliminating, the number of unnecessary and/or otherwise undesirable operations that would otherwise need to be performed due to inconsistencies in the queues of the main and standby processor modules in the event of a hardware/power failure. Once the queues 3102, 3110 of the standby processor module have been cleared of unneeded operations and requests and/or loaded with needed operations, as discussed above, the remaining sequence of operations in such operation queues 3110 may be performed according to the sequence of I/O requests in the request queue 3102. At this point, the hardware/power failure recovery work has been completed and the standby processor module may resume normal queuing operations.

[0379] Those skilled in the art will recognize that the above-described methods are merely illustrative, and that a variety of similar methods can be performed to produce substantially the same results. For example, the presence of the associated release checkpoint 2933 may be determined prior to determining the presence of the associated index checkpoint 2932.

[0380]Write request recording for initiating map generation

[0381] In general, in another aspect, the invention relates to a method and apparatus for recording write requests directed to a data store having a current store and a time store associated therewith, and to a method and apparatus for enabling generation of at least a portion of a time map of at least a portion of the data store of past time (e.g., of the current store or some sub-portion thereof). As described above, the time map is a map generated at the present time and has a current location of data stored in at least a portion of the data store at a specified past point in time.

[0382] As also described above, in one embodiment, a time map is generated by a computing device (e.g., a storage management device as described above) when, for example, a user requests an image of at least a portion of a data store of past time (e.g., of the current store or some sub-portion thereof) at the present time. By generating the time map, the computing device does not need to search through the entire index for the location of old data on every request for data covered by the map or a portion thereof. By contrast, by referencing the time map, the computing device can quickly and efficiently determine the locations of data stored in at least a portion of the data store at a past time, and thus, quickly and efficiently respond to user requests. Thus, system efficiency is improved and user satisfaction is increased.

[0383] While generating the time map increases the speed at which data stored in at least a portion of the data store may be accessed at a past time, the present aspects of the invention relate to methods and apparatus for recording write requests directed to the data store, and thereby increasing the speed at which the time map itself may be generated. In addition, the present aspects of the invention facilitate rapid presentation of data stored in a data store at a past time, even though a time map is still being generated.

[0384] In one embodiment, the computing device begins generating the time map upon receiving a request for a prior image. If, before the time map is completed, the user requests data covered by a portion of the image and the location of the data has not been added to the time map, then the system can search for the data quickly enough to provide a reasonable response time, even though the response will not be as fast as if the time map were completed and used. As described herein, instead of searching through the entire index for locations of past data, only one or more portions of the index need be searched in response to a user request for data covered by the portion of the image. The work done in generating the response (e.g., determining the location of the data) may also be stored in a time map, thereby increasing the overall efficiency of the system.

[0385] Thus, in one embodiment, the time map is generated upon receipt of a request to create a prior image, e.g., as a background process. If a request for data is directed to the prior image while the location of the requested data is not yet indicated by the time map (e.g., the time map has not yet been fully generated), the techniques described herein may be used to identify the location of the requested data and respond to the user's request for the data. The time map is then updated with the location of the requested data.

[0386] In brief overview, in one embodiment of this aspect of the invention, a first computing device (e.g., a storage management device as described above) receives a plurality of write requests from a second computing device (e.g., a host as described above). The first computing device stores a record of these write requests. In one embodiment, at least a first database table and a second database table are used to record information about write requests, and to track any changes caused by write requests on the data store. More specifically, for each write request received, the first computing device records the write request entry in a first database table. The write request entry contains information about the received write request. Further, each time a write request entry is recorded in the first database table, the first computing device updates the record in the second database table, if necessary. The data contained in the records of the second database table represents, in aggregated form, write requests directed to the data store. In one embodiment, for example, the data contained in the records of the second database table specifies a particular location in the data store that was overwritten as a result of the write request being performed.

[0387] According to one feature of this aspect of the invention, the first computing device is able to quickly and efficiently interpret the data stored in the records of the second database table to determine which particular storage unit has been overwritten. Further, in one embodiment, given a particular point in time in the past and tasked with generating the time map, the first computing device can interpret the data stored in the records of the second database table to identify a subset of the plurality of first database tables to search for write request entries related to the generation of the time map. In other words, in one embodiment, the present invention does not require the first computing device to search through all of the first database tables and through all of the write request terms when generating the time map. Thus, the overall efficiency is improved and a fast generation of the time map is made possible.

[0388] Additionally, in another embodiment, if, before the time mapping is completed, a user requests data stored in the data store at a past time, the current location of the data is still not indicated by the time mapping, but if the time mapping is completed the current location is indicated, then the first computing device can still quickly and efficiently identify the location of the data without searching all of the first database tables and can respond to the user. In addition, the work done in generating the response may be used to complete the time map.

[0389] FIG. 41 illustrates one embodiment of a storage management device 3338 that records write requests directed to a data store and enables generation of at least a portion of a time map of at least a portion of the data store (e.g., of the current store of the data store or some sub-portion thereof) of past times. In general, the storage management device 3338 may have the capabilities of, and may be implemented as, a storage management device as described above with the additional functionality described herein. It should be appreciated that other implementations are possible.

[0390] In one embodiment, the storage management device 3338 uses at least one first database table 3350, but typically uses a plurality of first database tables 3350, for recording a plurality of write request entries. The storage management device 3338 also uses a second database table 3352, the second database table 3352 including at least one record for each first database table 3350, the first database table 3350 being used by the storage management device 3338. In addition, the storage management device 3338 includes an update module 3354, the update module 3354 operable to update at least one record in the second database table 3352 each time a write request entry is recorded in the first database table. As previously described, the storage management device 3338 also manages at least one data store 3343 that is associated with a current store 3344 and a time store 3346.

[0391] Optionally, the storage management device 3338 may also include an identification module 3356, a search module 3358, a time map generation module 3360, and an I/O module 3362. In response to a request for data stored in at least a portion of the data store 3343 at a past time (e.g., in the current store 3344 or some sub-portion thereof), the storage management device 3338 may use the identification module 3356 to interpret one or more records in the second database table 3352 and thereby identify one or more first database tables 3350 to search for relevant write request items. The storage management device 3338 may then perform such a search using the search module 3358 and, when a relevant write request term has been found, may use the time map generation module to generate at least a portion of a time map of at least a portion of the data store for the past time. In addition, the storage management device 3338 may use the I/O module 3362 to respond to read requests for data stored in at least one designated storage location located within the data store 3343 at a past time.

[0392] The first and second database tables 3350 and 3352 may be implemented in any form, method, or manner useful for recording write request entries and records, respectively. In one embodiment, for example, the first database table 3350 and/or the second database table 3352 are implemented as spreadsheets. Alternatively, the first database table 3350 and/or the second database table 3352 may be implemented as separate files, bitmaps, arrays, trees, indexes, accounts, or any other manner useful for organizing data in text or table format.

[0393] To their extent, the update module 3354, the identification module 3356, the search module 3358, the time map generation module 3360, and the I/O module 3362 may be implemented in any form, method, or manner that is capable of performing the functions described below. For example, the update module 3354, the identification module 3356, the search module 3358, the time map generation module 3360, and/or the I/O module 3362 may be implemented as software modules or programs running on a microprocessor and/or as hardware devices, such as, for example, an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

[0394] The data store 3343 may have the capabilities of the data store described above and may be implemented with the current store and time store described above, with the additional functionality described herein. For example, data associated with one or both of the current store 3344 and the time store 3346 may be stored in memory or in physical storage (not shown) of the storage management device 3338, for which data it may be stored directly or virtualized, and so forth.

[0395] Typically, the storage management device 3338 receives multiple write requests from one or more other computing devices, such as, for example, the hosts described above. The write request is directed to data store 3343. In a particular embodiment, the write request is directed to a current store 3344 of the data stores 3343. In one such embodiment, each time the storage management device 3338 receives a request to write new data to one or more specified blocks of the current storage 3344, the storage management device 3338 performs a copy-on-write operation as previously described. In other words, the storage management device 3338 copies existing data stored in the specified blocks of the current store 3344, writes the existing data to another location, such as within the time store 3346, and then writes the new data to the specified blocks of the current store 3344. As part of this copy-on-write operation, information about the write request, including the new location of the overwritten data, may be recorded in the first database table 3350. The second database table 3352 is then updated to reflect the execution of the write request and the recording of the information linking the write request in the first database table 3350.

[0396] Referring now to FIG. 42, in a brief overview of one embodiment of a method 3400, for example using the example storage management device of FIG. 41, for recording write requests directed to a data store, the storage management device 3338 records a write request entry in at least one first database table 3350 after each write request is performed (e.g., after each copy-on-write operation described above) (step 3404). The storage management device 3338 also maintains at least one record in the second database table 3352 for each first database table 3350 (step 3408) and updates at least one record in the second database table 3352 each time a write request entry is recorded in the first database table 3350 (step 3412), such as by using the update module 3354.

[0397] In one embodiment, when constructing a time map or otherwise determining the location of data stored in a particular storage unit, and typically at a later time than steps 3404, 3408 and 3412, the storage management device 3338 uses the identification module 3356 to interpret one or more records in the second database table 3352 to identify at least one first database table 3350 to be searched (step 3416), and uses the search module 3358 to search the at least one identified first database table 3350 (step 3420). The storage management device 3338 then uses the time map generation module 3360 to generate at least a portion of a time map of at least a portion of the data store 3343 at a past time (e.g., of the current store 3344 or some sub-portion thereof) (step 3424) and/or uses the I/O module 3362 to respond to requests for data stored at a past time in at least one designated storage unit located within at least a portion of the data store.

[0398] In more detail, and referring now to fig. 42 and 43, in one embodiment, after the storage management device 3338 receives a write request directed to the data store 3343, the storage management device 3338 records a write request entry 3504 in a first database table 3350 at step 3404. Each write request entry 3504 includes information about a write request. For example, the write request entry 3504 may include an identification of at least one unit of storage located within the data store 3343 (e.g., within the current store 3344) to which the write request is directed and/or a time at which the write request was received by the storage management device 3338.

[0399] In one embodiment, each received write request causes the execution of a copy-on-write operation as described above. In such an embodiment, each write request causes previous data previously stored in at least one storage location located within data store 3343 (e.g., within current store 3344) to be copied to a new location, such as within time store 3346 of data store 3343. The data contained in the write request is then written to at least one storage location located within the data store 3343 (e.g., within the current store 3344) from which the previous data was copied. Thus, the write request entry 3504 may also include a new location (e.g., a location within the time store 3346) to which previous data was copied.

[0400]As illustrated in fig. 43, when the storage management device 3338 receives more than one write request directed to the data store 3343, the storage management device 3338 records a plurality of write request entries 3504 in a first database table 3350. In one embodiment, the storage management device 3338 records all of the write request entries 3504 in a single first database table 3350, e.g., the first database table 3350¹Until the maximum number of write request entries 3504 is reached. Typically, the first database table 3350¹The maximum number of write request entries 3504 is for efficiency, or based on allocation to the first database table 3350¹Is set by the storage capacity of (a). Once the first database table 3350¹The number of write request entries 3504 in (1) reaches a maximum, the storage management device 3338 employs the new first database table 3350²And records therein the write request item 3504 whenever a write request is received. Again, when the record is in the first database table 3350²When the write request entries 3504 in (1) reach a maximum, the storage management device 3338 employs the new first database table 3350³(not shown), etc.

[0401] At step 3408, the storage management device 3338 maintains at least one record 3508 in the second database table 3352 for each first database table 3350. Referring to FIG. 43, at least a portion of the data store 3343 (e.g., the current store 3344 of the data store 3343 or some sub-portion thereof) may be conceptually organized by the storage management device 3338 into a number m of "buckets," where m > 1, and each of the m buckets relates to a fixed number of storage units located within the at least a portion of the data store 3343. In one such embodiment, for each first database table 3350, the storage management device 3338 maintains a record 3508 in the second database table 3352, as illustrated, for each of the m buckets. Alternatively, in another embodiment, the storage management device 3338 does not divide at least a portion of the data store 3343 into buckets. In such an embodiment (not shown), the storage management device 3338 maintains a single record 3508 in the second database table 3352 for each of the first database tables 3350.

[0402] Still referring to FIG. 43, each record 3508 includes a plurality of bit entries, and each bit entry is either set (i.e., "1") or reset (i.e., "0"). Further, in one embodiment, as illustrated by the vertical alignment in FIG. 43 for purposes of explanation, each bit entry in a record 3508 corresponds to at least one unit of storage located within at least a portion of the data store 3343.

[0403] Initially, in one embodiment, when the first database table 3350 is empty (i.e., when no write request entries 3504 have been recorded in the first database table 3350), all bit entries in each record 3508 associated with that first database table 3350 are reset (i.e., "0"). Thereafter, each time the storage management device 3338 records a write request entry 3504 in that first database table 3350, the storage management device 3338 updates at least one record 3508 (associated with that first database table 3350) in the second database table 3352 at step 3412. In one embodiment, the storage management device 3338 updates the at least one record 3508 by setting each reset bit entry in the at least one record 3508 using the update module 3354 and which corresponds to a unit of storage located within the at least one portion of the data store 3343 that is overwritten by a write request associated with the transient write request entry. Thus, each bit entry that is set (i.e., "1") in a record 3508 associated with a first database table 3350 indicates that at least one unit of storage located within at least a portion of the data store 3343 corresponding to the bit entry has been overwritten at least once during the formation of that first database table 3350. On the other hand, each bit entry that is reset (i.e., "0") in a record 3508 associated with a first database table 3350 indicates that at least one unit of storage located within at least a portion of the data store 3343 corresponding to the bit entry has not been overwritten at least once during the formation of that first database table 3350. In this manner, the data (i.e., bit entries) of one or more records 3508 in the second database 3352 represent the effect of the write request on the state of at least a portion of the data store 3343 (i.e., the data identifies at least one unit of storage within the at least a portion of the data store 3343 that was overwritten by the write request).

[0404] Those skilled in the art will recognize that the five-bit entry of each record 3508 illustrated in FIG. 43 is merely illustrative and is used for the purpose of explaining the present aspect of the invention. In practice, each record 3508 can include, for example, one or more bytes of a bit entry or one or more words (of any length) of a bit entry. Further, while the data of each record 3508 is illustrated in FIG. 43 as being in a binary representation, each record 3508 could alternatively store its data in a decimal, hexadecimal, or other representation. Further, each record 3508 can include, in addition to a bit entry representing the effect of the write request on the state of at least a portion of the data store 3343, an identifier for identifying the first database table 3350 with which that record 3508 is associated.

[0405] As just described, after storing and indexing the data, for example, using the database tables 3350, 3352 as above, the storage management device 3338 may efficiently determine whether the write request entries 3504 of the first database table 3350 are associated with writes to particular units of storage in the data store 3343. Thus, in response to a request, e.g., from a user, for data stored in at least a portion of the data store 3343 at a past time (e.g., in the current store 3344 or some sub-portion thereof), the identification module 3356 of the storage management device 3338 first identifies, at step 3416, at least one first database table 3350 to search for relevant write request items 3504. In one embodiment, to identify which first database table(s) 3350 to search, the identification module 3356 of the storage management device 3338 determines which storage units located within at least a portion of the data store 3343 have been overwritten. In one such embodiment, the identification module 3356 of the storage management device 3338 determines, for each unit of storage located within at least a portion of the data store 3343 (the unit of storage having a corresponding bit entry), whether at least one record 3508 in the second database table 3352 has a bit entry for the unit of storage that is set (i.e., "1").

[0406] More specifically, in one embodiment, for each particular unit of storage within at least a portion of the data store 3343, the identification module 3356 of the storage management device 3338 performs a Boolean logical OR operation on the bit entries of data in each record 3508 corresponding to the particular unit of storage. For ease of explanation, and with reference still to FIG. 43, when the storage management device 3338 has employed more than one first database table 3350, this is visually interpreted as performing a Boolean logical OR operation on the columns of data in the vertically aligned records 3508. If the Boolean logical OR operation returns a "1" for a particular column, then the particular unit of storage corresponding to that column has been overwritten and there are one or more write request entries 3504 in the at least one first database table 3350 that are associated with one or more write requests directed to that particular unit of storage. Otherwise, if the Boolean logical OR operation returns a "0" for a particular column, then the particular unit of storage corresponding to that column has not been overwritten at any time covered by the records 3508 in the second database table 3352.

[0407] For example, taking the exemplary data in three records of bucket 1 (i.e., record 1, record 2, 1, and record n, 1) that are at least part of the data store 3343 illustrated in FIG. 43, the Boolean logic OR operation previously described (i.e., 10010 OR 10010 OR 01010) is performed on the vertically aligned bit entries of these records, thereby generating the result 11010. The result indicates that the first, second, and fourth units of storage located within at least a portion of the data store 3343 represented in FIG. 43 have been overwritten at some point in time, and that for each of these units of storage, at least one record 3508 for bucket 1 has a bit entry corresponding to the unit of storage and that the bit entry is set (i.e., "1"). The result also indicates that the third and fifth units of storage represented in FIG. 43, located within the portion of the data store 3343, were not overwritten at the point in time covered by the data, and that each bit entry in the record 3508 for bucket 1 corresponding to that unit of storage was reset (i.e., "0") for each of these units of storage.

[0408] The identification module 3356 of the storage management device 3338 identifies those one or more records 3508 upon determining, for a particular unit of storage located within at least a portion of the data store 3343, that at least one record 3508 has a bit entry for the particular unit of storage that is set (i.e., "1"), the record 3508 having the bit entry for the particular unit of storage that is set. The identification module 3356 then also identifies one or more first database tables 3350 for which those identified records 3508 are maintained. In one embodiment, to accomplish these steps, the identification module 3356 of the storage management device 3338 first scans only the relevant bit entries to determine which has been set to "1". Returning to our example, which includes the three records 3508 (i.e., record 1, record 2, 1, and record n, 1) of bucket 1 of at least a portion of the data store 3343 illustrated in FIG. 43, the identification module 3356 of the storage management device 3338 scans the bit entries of those records corresponding to first, second, and fourth units of storage that are located within at least a portion of the data store 3343 illustrated in FIG. 43. However, the identification module 3356 of the storage management device 3338 does not need to, and does not scan, the bit entries of these records corresponding to the third and fifth locations within at least a portion of the data store 3343 illustrated in FIG. 43, since the identification module 3356 knows that they are all reset (i.e., "0") due to the execution of the Boolean logic OR operation described previously.

[0409] With those bit entries of the records 3508 so scanned, the identification module 3356 of the storage management device 3338 would then, in accordance with the present invention, identify the following first database table 3350 to search for the write request entries 3504 that are related to write requests to the first, second and fourth units of storage located within at least a portion of the data store 3343 illustrated in FIG. 43:

memory cell	First database table to be searched
memory cell	First database table to be searched	First of all	3350¹，3350²
Second one	3350ⁿ	First of all	3350¹，3350²
Second one	3350ⁿ	Fourth step of	3350¹，3350²，3350ⁿ

[0410] After the identification module 3356 has identified one or more first database tables 3350 to be searched at step 3416, the search module 3358 of the storage management device 3338 searches 3420 those identified first database tables 3350. It should be recalled here that the storage management device 3338 will have requested, e.g., by a user, data that was stored in at least a portion of the data store 3343 (e.g., in the current store 3344 or in some sub-portion thereof) at a past time. Thus, in one embodiment, for each at least one unit of storage located within at least a portion of the data store 3343 that has a corresponding bit entry set in the record 3508 (e.g., returning to our above example, for each of the first, second, and fourth units of storage in at least a portion of the data store 3343 illustrated in FIG. 43), the search module 3358 of the storage management device 3338 performs the following steps. First, the search module 3358 searches the write request entries 3504 of the first database table 3350 identified by the identification module 3356, as described above. The search module 3358 then determines from those write request entries 3504 a first time after the past time at which the previous data stored in the at least one unit of storage was copied to a new location (such as within the time store 3346 of the data store 3343) as a result of performing the copy-on-write operation described previously and was overwritten at the at least one unit of storage. After determining the first time, the search module 3358 then determines from the write request entries 3504 a new location, such as a new location within the time store 3346, to which the previous data was copied at the first time. The previous data is currently being stored at the new location. The new location is used to generate at least a portion of a time map of at least a portion of the data store 3343 of the past time and/or to respond to a user read request for data stored in at least a portion of the data store 3343 at the past time, each as described below.

[0411] Of course, in some embodiments, even if a unit of storage located within at least a portion of the data store 3343 has a corresponding bit entry set in the record 3508, the search module 3358 will not be able to determine a first time after the past time at which previous data stored at the unit of storage was copied to a new location (e.g., within the time store 3346) and overwritten at the unit of storage. One example of a situation where the search module 3358 will not be able to make this determination is where a memory location located within at least a portion of the data store 3343 is overwritten at a time before the past time, rather than after the past time. In this case, the data stored in the storage unit at the past time will not have been copied to the new location, but instead will be stored in the storage unit at the present time.

[0412] As explained above, in the event that the Boolean logic OR operation returns a "0" for a particular row of vertically aligned bit entries in FIG. 43, the particular location in the at least a portion of the data store 3343 corresponding to that column is not overwritten at any time covered by a record in the second database table 3352. Thus, in this case, the data stored in that particular storage location at the past time would likewise not have been copied to the new location, but would instead still be stored in that storage location at the present time.

[0413] In one embodiment, after the search module 3358 has identified, for each storage unit located within at least a portion of the data store 3343, the location at which the data stored at that storage unit at the past time is currently stored (whether it is still in that storage unit or whether it is located in a new location, such as within the time store 3346, as explained), the time map generation module 3360 of the storage management device 3338 generates at least a portion of the time map of at least a portion of the data store 3343 at the past time at step 3424. In one embodiment, the time map generation module 3360 generates the time map by mapping each storage unit located within at least a portion of the data store 3343 to a location at which data stored in the storage unit at a past time is currently stored. The mapping may be, for example, as simple as recording in a database, for each storage unit located within at least a portion of the data store 3343, an identification of where data stored in the storage unit at a past time is currently stored.

[0414] In another embodiment, the storage management device 3338 receives a read request from a host, such as described above, for data stored in at least one specified storage location within at least a portion of the data store 3343 at a past time. In one embodiment, the read request is received after the time map generation module 3360 of the storage management device 3338 has begun generating a time map for the same elapsed time, but before it completes the time map. In this case, if some portion of the completed time map covers at least one memory location specified in the read request, then the I/O module 3362 of the storage management device 3338 determines the location of the data from the time map at step 3428 (which, as explained, may be a specified memory location within at least a portion of the data store 3343 if the requested data is not overwritten, or may be a new location within, for example, the time store 3346 if the requested data has been overwritten). Alternatively, if, in this case, some portion of the completed time map does not overwrite at least one storage location specified in the read request, or if, in other embodiments, the storage management device 3338, for example, is not configured to generate a time map or only generates or begins to generate a time map for an elapsed time (which is different from the elapsed time specified in the read request), then the storage management device 3338 performs steps 3416 and 3420 of method 3400 described above. However, in so performing steps 3416 and 3420 of method 3400, the storage management device 3338 need not perform the operations previously described for each storage unit located within at least a portion of the data store 3343. In contrast, the storage management device 3338 need only perform the previously described operations of steps 3416 and 3420 of method 3400 for each storage unit specified in the read request. In other words, the storage management device 3338 need only determine the new location(s) to which the data previously stored in each storage unit specified in the read request was copied and is now located.

[0415] After determining the new location, the I/O module 3362 of the storage management device 3338, in response to the read request, reads the data from the new location and sends it to the requestor, e.g., a host as described above, at step 3428. Further, in the case where the time map generation module 3360 of the storage management device 3338 has begun generating a time map for an elapsed time (which is the same as the elapsed time specified in the read request), but the time map has not yet been completed when the read request is received, and where some portion of the time map that has been completed does not cover at least one storage unit specified in the read request, the work that the storage management device 3338 completed in generating a response to the read request (i.e., performing steps 3416 and 3420 of method 3400 to determine a new location(s) to which data previously stored in each storage unit specified in the read request was copied and is now located) may be used by the time map generation module 3360 of the storage management device 3338 to complete the time map.

[0416] One skilled in the art will recognize that the implementation of the method 3400 described above may be changed or modified in various ways while still employing the principles described without affecting the results of the method. For example, in one embodiment, each bit entry in record 3508 that is set may be represented by a "0" as opposed to a "11", while each bit entry that is reset may be represented by a "1" as opposed to a "0". In such an embodiment, upon determining whether at least one record 3508 has a bit entry for a particular unit of storage located within at least a portion of the data store 3343 that is set for that particular unit of storage, the identification module 3356 performs a boolean and operation on the bit entries of each record 3508 corresponding to that particular unit of storage, as opposed to the boolean or operation described above. In this case, if the Boolean logic "AND" operation returns a "0" for a particular column, then the particular unit of storage corresponding to that column has been overwritten and there are one or more write request entries 3504 in the at least one first database table 3350 that are associated with one or more write requests directed to that particular unit of storage. Otherwise, if the Boolean logical "AND" operation returns a 1 for a particular column, then the particular unit of storage corresponding to that column has not been overwritten at any time covered by a record 3508 in the second database table 3508. Further, as another example, the bit entries may be used to represent any number of storage units, so long as the conversion is applied consistently when the data is written or read.

[0417] The invention may be provided as one or more modules of one or more computer-readable programs embodied on or in one or more articles of manufacture. The article of manufacture may be, by way of non-limiting example, a floppy disk, a hard disk, a CD ROM (compact disk read Only memory), a flash memory card, a PROM (programmable read Only memory), a RAM (random Access memory), a ROM (read Only memory), or a magnetic tape. Generally, the computer readable program can be implemented in any programming language. Some examples of languages that may be used include C, C + + or JAVA. The software programs may be stored on or in one or more articles of manufacture as object code.

[0418] Variations, modifications, and other implementations of what is described herein will occur to those of ordinary skill in the art without departing from the spirit and the scope of the invention as claimed. Accordingly, the invention is to be defined not by the preceding illustrative description but instead by the spirit and scope of the following claims.

[0419] The claims are as follows:

Claims

1. A method of checkpointing a copy-on-write operation sequence, the method comprising:

(a) receiving a first write request identifying payload data to be written starting from a first address of a first data store;

(b) reading original data associated with the first address of the first data store;

(c) copying the original data to a second data store starting at a second address;

(d) recording transaction information associated with the first write request, the transaction information including a flag associated with the second address;

(e) generating a first checkpoint confirming that transaction information was successfully recorded and that the original data was successfully copied to the second data store;

(f) writing the payload data to the first data memory starting at the first address;

(g) in response to successfully writing the payload data to the first data store, confirming successful completion of the copy-on-write operation sequence; and

(h) generating a second checkpoint confirming successful completion of the copy-on-write operation sequence.

2. The method of claim 1, wherein the transaction information further comprises a flag associated with a time at which the first write request was received.

3. The method of claim 1, wherein recording the transaction information comprises storing a first address, a second address, and a time at which the first write request was received in at least one data structure.

4. The method of claim 1, further comprising:

based on at least one of the first and second checkpoints, a representation of at least one storage unit as it existed prior to a storage transaction failure is formed.

5. The method of claim 4, wherein the storage transaction failure corresponds to at least one of a hardware failure and a power failure.

6. The method of claim 1, further comprising:

removing information associated with the second write request from at least one queue of a processor module based on finding the second checkpoint in the transactional information.

7. The method of claim 1, further comprising:

loading information associated with the first write request into at least one queue of a processor module based on the first checkpoint being found in the transactional information.

8. A method of recovering from a failure associated with a copy-on-write sequence of operations, the method comprising:

Identifying at least one write request queued prior to a failure, the write request corresponding to a plurality of operations in a copy-on-write operation sequence;

determining whether a first checkpoint is formed in response to completion of a first portion of the sequence of operations;

determining whether a second checkpoint is formed in response to completion of at least a second portion of the copy-on-write operation sequence; and

processing the queued write requests to recover, at least in part, from the failure based on at least one of the first and second checkpoints, the write request processing comprising:

queuing a plurality of operations in the sequence of copy-on-write operations for execution when the first checkpoint cannot be located,

queuing a subset of the plurality of operations for execution when the first checkpoint is located without locating the second checkpoint, the subset of queued operations including operations associated with the second portion of the copy-on-write operation sequence instead of the first portion, and

removing the plurality of operations in the copy-on-write operation sequence from at least one operation queue when the first and second checkpoints are located.

9. The method of claim 8, wherein the failure corresponds to a hardware failure associated with a processor module.

10. The method of claim 8, wherein the first portion of the copy-on-write operation sequence comprises at least one operation associated with copying original data stored at a location within a first data store to a second data store, the location within the first data store being identified by the queued write requests.

11. The method of claim 8, wherein the second portion of the copy-on-write operation sequence comprises at least one operation associated with writing payload data identified by the queued write request to a location within the first data store.

12. A method of recovering from a storage transaction failure, the method comprising:

receiving a write request identifying payload data to be written starting from a first address of a first data store;

copying original data associated with the first address of the first data store to a second data store beginning at a second address;

recording transaction information associated with the write request, the transaction information including flags associated with the first and second addresses and a time at which the write request was received;

Generating a first indicator confirming the recording of the transaction information;

writing the payload data to the first data memory starting at the first address; and

in response to a storage transaction failure, using the first indicator and at least some of the transaction information to recover at least partially therefrom.

13. The method of claim 12, further comprising: generating a second indicator confirming that the payload data is written to the first data store.

14. The method of claim 12, wherein the first indicator further confirms that the original data associated with the first address of the first data store was successfully copied to the second data store.

15. The method of claim 12, further comprising: based on at least some of the copied original data in the second data store, a representation of at least one memory location as it existed prior to the failure of the storage transaction is formed.

16. The method of claim 12, wherein the storage transaction failure corresponds to at least one of a hardware failure and a power failure of a primary processor module.

17. The method of claim 16, further comprising:

generating a second indicator confirming that the payload data is written to the first data store; and

based on the second indicator being found in the transaction information, information associated with the second write request is removed from at least one queue of a standby processor module, wherein the standby processor module assumes the role of the primary processor module in response to the storage transaction failure.

18. The method of claim 17, further comprising: loading information associated with a third write request into the at least one queue of the standby processor module based on the first indicator being found in the transaction information.

19. A method of recovering from a storage transaction failure, the method comprising:

detecting a storage transaction failure occurring at time T1;

identifying a first indicator associated with a first write request, the first write request received prior to time T1, the first indicator confirming that data originally stored starting at a first address specified by the first write request has been copied to a second address; and

Based on at least some of the copied data, a representation of at least one memory location as it existed prior to the failure of the memory transaction is formed.

20. The method of claim 19, wherein the storage transaction failure corresponds to at least one of a hardware failure and a power failure of a primary processor module.

21. The method of claim 20, further comprising:

identifying a second indicator confirming that payload data specified by the first write request was written to a data store beginning at the first address;

based on the second indicator, information associated with the second write request is removed from at least one queue of a standby processor module, wherein the standby processor module assumes the role of the primary processor module in response to the storage transaction failure.

22. The method of claim 21, further comprising:

loading information associated with the first write request into at least one queue of the standby processor module based on the first indicator.