HK1142971A

Movatterモバイル変換

Info

Publication number: HK1142971A
Application number: HK10109328.2A
Authority: HK
Inventors: 大卫‧弗林; 乔纳森‧撒切尔; 迈克尔‧扎佩; 大卫‧阿特金森
Original assignee: Fusion-Io, Inc.
Priority date: 2006-12-06
Filing date: 2007-12-06
Publication date: 2010-12-17

Description

Apparatus, system, and method for storing data using progressive RAID

Background

Cross Reference to Related Applications

This application is a continuation-in-part application of and claims priority to: U.S. provisional patent application entitled "element Blade System" filed on 6.12.2006 by DavidFlynn et al (application number: 60/873,111); U.S. provisional patent application entitled "Apparatus, System, and Method for Object-organized Solid-State Storage", filed on 22.9.2007 by David Flynn et al (application number: 60/974,470). The above applications are incorporated herein by reference.

Technical Field

The present invention relates to data storage, and more particularly to storing data using a progressive RAID system.

Background

Redundant arrays of independent drives ("RAIDs") may be constructed in many ways to achieve different purposes. As described below, a drive is a mass storage device that stores data. The drive or storage device may be a solid state memory, a hard disk drive ("HDD"), a tape storage, an optical drive, or any other mass storage device known to those skilled in the art. In one embodiment, the drive includes a portion of the mass storage device that is accessed in virtual capacity. In another embodiment, the drive includes two or more data storage devices, similar to a RAID, just a bunch of disks/drives ("JBOD"), that are accessed together in virtual capacity and built in a storage area network ("SAN"). Drives are typically accessed by the storage controller in a single unit or virtual capacity. In a preferred embodiment, the storage controller comprises a solid state storage controller. Those skilled in the art will recognize other forms of drives in a RAID that are constructed in the form of mass storage devices. In the embodiments described below, the drives and storage devices are used interchangeably.

Traditionally, the different RAID configurations are referred to as RAID levels. The basic RAID configuration is RAID level 0, which produces a mirrored copy of the storage devices. RAID0 has the advantage that a full copy of the data in one or more storage devices is available within a mirrored copy of one or more storage devices, thus enabling relatively fast reading of the data in the primary or mirror drives. RAID0 also provides backup copies of data in the event of a primary storage device failure. RAID0 has the disadvantage that writing is relatively slow since the write data is written twice.

Another RAID configuration is RAID level 1. In RAID level 1, the data written to the ARID is divided into N data segments corresponding to N storage devices in the storage device set. The N data segments form a "stripe". Because the speed at which a plurality of storage devices store N data segments in parallel is faster than the speed at which a single storage device stores data containing N data segments, performance is improved by striping the data in the plurality of storage devices. Reading data is relatively slow because data may be distributed among multiple storage devices, and the access time of multiple storage devices is typically less than the time to read data from one storage device containing all the desired data. Further, RAID1 does not provide fault protection.

A common RAID configuration is RAID level 5, which includes striping N data segments out of N storage devices and storing parity data segments in N +1 storage devices. RAID5 may also be fault tolerant, as RAID may tolerate a single failure of a storage device. For example, if a storage device fails, other available data segments and parity data segments specifically computed for the stripe may be used to generate missing data segments for the stripe. RAID5 typically uses less storage space than RAID0 because each storage device of a RAID set in a storage device need not store a full copy of the data, only data segments of a stripe or parity data segments. RAID5 is similar to RAID1 in that write data is relatively fast and read data is relatively slow. But since the parity data segments for each stripe must be computed from the N data segments of the stripe, the data is typically written slower to conventional RAID5 than to RAID 1.

Another common RAID configuration is RAID level 6, which includes dual distributed parity. In RAID6, two storage devices are assigned as parity-mirror devices (e.g., 1602a, 1602 b). Each parity data segment of the stripe is computed separately so that any two storage devices in the storage device set that are lost can be recovered using the remaining available data segments and/or parity data segments. The performance advantages and disadvantages of RAID6 are the same as RAID 5.

Multiple RAIDs may also be used to increase fault tolerance when high reliability is required. For example, mirroring is generated in a RAID0 configuration for two storage device sets configured as RAID 5. The resulting configuration may be referred to as RAID 50. If RAID6 is used for each mirrored set, the configuration may be referred to as RAID 60. Multiple RAID configurations typically have the same performance issues as the underlying RAID group.

Disclosure of Invention

As is apparent from the foregoing discussion, there is a need for an apparatus, system, and method for progressive RAID that facilitates fault tolerance, faster data write speeds than conventional fault tolerant RAID levels (e.g., RAID0, RAID5, RAID6, etc.), and faster data read speeds than conventional striped RAID levels (e.g., RAID1, RAID5, RAID6, etc.). Advantageously, such an apparatus, system, and method take advantage of the RAID0 system by writing N data segments to parity-mirrored storage prior to computing the parity data segments as needed (e.g., prior to a storage merge operation or as part of a storage merge operation).

The present invention has been developed in response to the present state of the art, and in particular, in response to the problems and needs in the art that have not yet been fully solved by currently available data management systems. Accordingly, the present invention has been developed to provide an apparatus, system, and method for reliably and efficiently storing data using progressive RAID that overcome many or all of the above-discussed shortcomings in the art.

An apparatus for progressive RAID has a plurality of modules including a storage request receiving module, a striping module, a parity-mirroring module, and a parity progression module. The storage request receiving module receives a data storage request. The data includes data of a file or data of an object. The striping module calculates a stripe shape for the data. The stripe shape includes one or more stripes, each stripe including N data segments. The striping module also writes N data segments of the stripe to the N storage devices, wherein each of the N data segments is written to a different storage device in the set of storage devices assigned to the stripe.

The parity-mirror module writes the N sets of data segments of the stripe to one or more parity-mirror storage devices of the set of storage devices. A parity-mirror storage device is a device other than N storage devices. The parity progression module computes one or more parity data segments of the stripe in response to the storage consolidation operation. The one or more parity data segments are computed from the N data segments stored in the one or more parity-mirror storage devices. The parity progression module also stores the parity data segments in one or more parity-mirrored storage devices. A storage consolidation operation is implemented for consolidating data by restoring at least one storage space and data of at least one of the one or more parity-mirrored storage devices.

In one embodiment, the apparatus may include a parity replacement module that alternately allocates (for each stripe) storage devices in a storage device set as one or more parity-mirrored storage devices for the stripe. In another embodiment, the storage consolidation operations are performed autonomously based on storage operations of the storage reception module, the striping module, and the parity-mirror module.

In one embodiment, the storage device comprises a first set of storage devices, and the apparatus comprises a mirror set module that generates one or more sets of storage devices other than the first set of storage devices, wherein each of the one or more other sets of storage comprises at least one associated striping module that writes N data segments to each of the one or more other sets of storage. In another embodiment, each of the one or more additional sets of storage devices includes an associated parity-mirror module that stores a set of N data segments. In yet another embodiment, an apparatus includes a parity progression module that computes one or more parity data segments.

In one embodiment, the apparatus is further configured to include an update module to update the data segment by receiving the updated data segment. The updated data segment corresponds to an existing data segment of the N data segments stored in the N storage devices. The update module copies the updated data segment to a storage device of a stripe that stores the existing data segment and also to one or more parity-mirror storage devices of the stripe. The update module replaces an existing data segment stored in a storage device of the N storage devices with an updated data segment. The update module is responsive to the parity progression module (which did not generate one or more parity data segments within the one or more parity-mirror storage devices) to replace an existing data segment stored within the one or more parity-mirror storage modules with the updated data segment.

In one embodiment of the apparatus, the set of first storage devices is a first set of storage devices, and the apparatus includes a mirror repair module to restore data segments stored within the storage devices of the first set of storage devices. The storage devices of the first set of storage devices are unusable. The data segment is recovered from the mirrored storage device containing the copy of the data segment. The mirrored storage device includes one storage device of one or more storage device sets that stores copies of the N data segments. In another embodiment, the image repair module further comprises a direct client response module that sends the requested data segment from the image storage device to the client.

In one embodiment, an apparatus includes a pre-consolidation recovery module that recovers data segments stored in storage devices of a storage set in response to a request to read the data segments when the storage devices are unavailable, and that recovers the data segments from a parity-mirror storage device before a parity progression module generates one or more parity data segments in one or more parity-mirror storage devices.

In another embodiment, the post-consolidation repair module recovers data segments stored in storage devices of the storage set when the storage devices are unavailable and recovers the data segments using one or more parity data segments stored in one or more parity-mirrored storage devices after the parity progression module generates the one or more parity data segments in response to the storage consolidation operation.

In one embodiment of the apparatus, the data reconstruction module stores the recovered data segment to the replacement storage device in a reconstruction operation when the recovered data segment matches an unavailable data segment stored in the unavailable storage device. The unavailable storage device is one of the N storage devices. The rebuild operation is to restore the data segment to the replacement storage device to match the data segment previously stored in the unavailable storage device. If the matching data segment is in the parity-mirror storage device, the reconstruction operation may recover the recovered data segment from the matching data segment stored in the parity-mirror storage device.

If the recovered data segment is not in one or more parity-mirror storage devices, the data segment may be recovered from the mirror storage device containing the copy of the unavailable data segment. The mirrored storage device is one storage device of a set of one or more storage devices that store the N copies of the data segments. If the recovered data segment is not located in one or more of the parity-mirror storage devices or the mirror storage device, the data segment may be recovered from a regenerated data segment generated from the one or more parity data segments and available ones of the N data segments.

In one embodiment, the parity rebuild module replaces the recovered parity data segment in the storage device in a parity rebuild operation. The recovered parity data segment matches an unavailable parity data segment stored in the unavailable parity-mirror storage device. The unavailable parity-mirror storage device is one of the one or more parity-mirror storage devices. The parity rebuild operation is used to restore the parity data segments to the replacement storage device to match the parity data segments previously stored in the unavailable parity-mirror storage device.

In a rebuild operation, the recovered parity data segments are regenerated using parity data segments stored in parity-mirror storage devices of a second storage device set that stores mirror copies of the stripes. If N data segments are available in the N storage devices, the recovered parity data segments are regenerated using the N data segments stored in one of the N storage devices. If one or more of the N data segments in the N storage devices are not available and the matching parity data segment in the second set of storage devices is not available, the recovered parity data segment is regenerated using one or more storage devices in the second set of storage devices that store the N copies of the data segment. Regardless of the location of the available data segments and the unmatched parity data segments in the set of one or more storage devices, the recovered parity data segments may be regenerated using the available data segments and the unmatched parity data segments.

In another embodiment, the N storage devices include N solid-state storage devices, each having a solid-state controller. In another embodiment, at least one of the following operations is performed in one of a storage device, a client, and a third party RAID management device of a storage device set: receiving a data storage request, calculating a stripe shape and writing N data segments to N storage devices, writing N sets of data segments to a parity-mirror storage device, and calculating parity data segments.

Another device may be used to update data for a progressive RAID group. The apparatus may include an update receiving module, an update replication module, and a parity update module. The update receiving module receives an updated data segment, wherein the updated data segment corresponds to an existing data segment of an existing stripe. Stripes comprise data in a file or data in an object divided into one or more stripes, where each stripe comprises N data segments and one or more parity data segments. The N data segments are stored in storage devices of a set of storage devices assigned to the stripe, and the parity data segments are each generated from the N data segments of the stripe and are each stored in one or more parity-mirror storage devices assigned to the stripe.

The storage device set includes one or more parity-mirrored storage devices, and the existing stripe includes N existing data segments and one or more existing parity data segments. The update copy module copies the updated data segments to a storage device that stores corresponding existing data segments and also to one or more parity-mirror storage devices that correspond to existing stripes. The parity update module calculates one or more updated parity data segments for one or more parity-mirrored storage devices of an existing stripe in response to a storage consolidation operation. The storage consolidation operation is to restore at least one storage space and data in the one or more parity-mirrored storage devices using the updated one or more parity data segments.

In one embodiment of the apparatus, the updated parity data segment is calculated from the existing parity data segment, the updated data segment, and the existing data segment. In another embodiment, prior to reading the existing data segment to generate the updated parity data segment, the existing data segment remains intact, the existing data segment is copied to the data-mirrored storage device in response to receipt of a copy of the updated data segment by a storage device of the N storage devices storing the existing data segment, and/or the existing data segment is copied to the data-mirrored storage device in response to a storage consolidation operation by a storage device of the N storage devices storing the existing data segment.

In another embodiment, the updated parity data segment is calculated from an existing parity data segment, an updated data segment, and a delta data segment, wherein the delta data segment is generated from a difference between the updated data segment and the existing data segment. In yet another embodiment, the delta data segment is stored in a storage device storing the existing data segment prior to reading the delta data segment to generate the updated parity data segment, and the delta data segment is copied to a data-mirror storage device in response to receipt of a copy of the updated data segment by the storage device storing the existing data segment and/or the delta data segment is copied to the data-mirror storage device in response to a storage consolidation operation by the storage device storing the existing data segment. In one embodiment, the storage consolidation operation is performed autonomously according to the operation of the update receiving module and the update replication module.

The system of the present invention also appears to store data reliably and with high performance. The system includes a set of storage devices assigned to a stripe. The set of storage devices includes N storage devices and one or more parity-mirror storage devices other than the N storage devices. The system also includes a storage request receiving module, a striping module, a parity-mirror module, and a parity progression module.

The storage request receiving module receives a data storage request. The data includes data of a file or data of an object. The striping module calculates a stripe shape for the data. The stripe shape includes one or more stripes, each stripe including a set of N data segments, and writing the N data segments of the stripe to N storage devices, wherein each of the N data segments is written to a different storage device of the set of storage devices. The parity-mirror module writes a set of N data segments of the stripe to each of the one or more parity-mirror storage devices.

The parity progression module computes one or more parity data segments for the stripe in response to the storage consolidation operation. The one or more parity data segments are computed from one or more N data segments stored in the parity-mirror storage device. The parity progression module further stores the parity data segments into each of the one or more parity-mirror storage devices, wherein the storage consolidation operations are performed autonomously according to the storage operations of the storage reception module, the striping module, and the parity mirroring module. The storage consolidation operation is used to recover at least one storage space and data in one or more parity-mirrored storage devices.

The system essentially also comprises the above-mentioned modules and embodiments in connection with the device. In one embodiment, a system includes one or more servers including N storage devices and one or more parity-mirror storage devices. In another embodiment, the system includes one or more clients in the one or more servers, wherein the storage receiving module receives the request from at least one of the one or more clients.

The method of the present invention also appears to store data reliably and with high performance. The method in the disclosed embodiments substantially includes the steps necessary to carry out the functions presented above with respect to the operation of the described apparatus and system. In one embodiment, the method includes receiving a data storage request. The data includes data of a file or data of an object. The method includes computing a stripe shape for data, wherein the stripe shape includes one or more stripes, each stripe including a set of N data segments. The method includes writing N data segments to N storage devices, where each of the N data segments is written to a different storage device in a set of storage devices assigned to a stripe.

The method includes writing N data segments of a stripe to one or more parity-mirror storage devices in a set of storage devices. The one or more parity-mirror storage devices are storage devices other than the N storage devices. The method includes computing a parity data segment of a stripe in response to a storage consolidation operation and storing the parity data segment in a parity-mirrored storage device. The parity data segments are computed from the N data segments stored in the parity-mirror storage device. The storage consolidation operation is performed autonomously after receiving a request to store N data segments. The method includes writing N data segments to N storage devices or writing N data segments to one or more parity-mirror storage devices. The storage consolidation operation is used to restore at least one storage space and data in the parity-mirror storage device.

Another method of the present invention also represents reliably and efficiently storing data. The method includes receiving an update data segment. The update data segment corresponds to an existing data segment of an existing stripe. A stripe includes data of a file or object divided into one or more stripes. Each stripe includes N data segments and a parity data segment. The N data segments are stored in the storage devices of the storage device set assigned to the stripe. Each parity data segment is generated from N data segments of the stripe and stored within one or more parity-mirror storage devices assigned to the stripe. The storage device set includes one or more parity-mirror storage devices. The existing stripe includes N existing data segments and one or more existing parity data segments.

The method includes copying the updated data segments into storage that stores corresponding existing data segments and into one or more parity storage devices that correspond to existing stripes. The method includes computing one or more updated parity data segments for one or more parity-mirrored storage devices of an existing stripe in response to a storage consolidation operation. The storage consolidation operation restores at least one storage space and data in the one or more parity-mirrored storage devices using the one or more updated parity data segments.

Reference throughout this specification to features, advantages, or similar language does not imply that all of the features and advantages may be realized in the present invention should be or are in any single embodiment of the invention. Of course, the wording relating to features and advantages is understood to mean: particular features, advantages, or characteristics described in connection with the embodiments are included in at least one embodiment of the invention. Thus, discussion of the features and advantages, and similar language, throughout this specification may, but do not necessarily, refer to the same embodiment.

Furthermore, the described features, advantages, and characteristics of the invention may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize that the invention may be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the invention.

These features and advantages of the present invention will become more fully apparent from the following description and appended claims, or may be learned by the practice of the invention as set forth hereinafter.

Drawings

In order that the advantages of the invention will be readily understood, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments that are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings, in which:

FIG. 1A is a schematic block diagram illustrating one embodiment of a system for data management within a solid-state storage device in accordance with the present invention;

FIG. 1B is a schematic block diagram illustrating one embodiment of a system for object management within a storage device in accordance with the present invention;

FIG. 1C is a schematic block diagram illustrating one embodiment of a system for a storage area network within a server in accordance with the present invention;

FIG. 2A is a schematic block diagram illustrating one embodiment of an apparatus for object management within a storage device in accordance with the present invention;

FIG. 2B is a schematic block diagram illustrating one embodiment of a solid-state storage device controller within a solid-state storage device in accordance with the present invention;

FIG. 3 is a schematic block diagram illustrating one embodiment of a solid-state storage controller having a write data pipe and a read data pipe in accordance with the present invention;

FIG. 4A is a schematic block diagram illustrating one embodiment of a bank interleave controller for use within a solid state storage controller in accordance with the present invention;

FIG. 4B is a schematic block diagram illustrating an alternative embodiment of a bank interleave controller for use within a solid state storage controller in accordance with the present invention;

FIG. 5A is a schematic flow chart diagram illustrating one embodiment of a method for managing data using a data pipeline in a solid state storage device in accordance with the present invention;

FIG. 5B is a schematic flow chart diagram illustrating one embodiment of a method for an in-server SAN in accordance with the present invention;

FIG. 6 is a schematic flow chart diagram illustrating another embodiment of a method for managing data using a data pipeline in a solid state storage device in accordance with the present invention;

FIG. 7 is a schematic flow chart diagram illustrating one embodiment of a method for managing data in a solid-state storage device using bank interleaving in accordance with the present invention;

FIG. 8 is a schematic block diagram illustrating one embodiment of an apparatus for garbage collection within a solid state storage device in accordance with the present invention;

FIG. 9 is a schematic flow chart diagram illustrating one embodiment of a method for garbage collection within a solid state storage device in accordance with the present invention;

FIG. 10 is a schematic block diagram illustrating one embodiment of a system for progressive RAID in accordance with the present invention;

FIG. 11 is a schematic block diagram illustrating one embodiment of an apparatus for progressive RAID in accordance with the present invention;

FIG. 12 is a schematic block diagram illustrating one embodiment of an apparatus for updating data segments using progressive RAID in accordance with the present invention;

FIG. 13 is a schematic flow chart diagram illustrating one embodiment of a method for managing data using a progressive RAID process in accordance with the present invention; and

FIG. 14 is a schematic flow chart diagram illustrating one embodiment of a method for updating data segments using a progressive RAID process in accordance with the present invention.

Detailed Description

Many of the functional units described in this specification have been labeled as modules, in order to more particularly emphasize their operational independence. For example, a module may be implemented as a hardware circuit comprising custom VLSI circuits, gate arrays, or off-the-shelf semiconductors such as logic chips, transistors, or other discrete components. A module may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like.

Modules may also be implemented in software for execution by various types of processors. An identified module of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions which may, for instance, be organized as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but may comprise disparate commands stored in different locations which, when joined logically together, comprise the module and achieve the stated purpose for the module.

Indeed, a module of executable code may be one or many instructions, and may even be distributed over several different code segments, among different programs, and across multiple memory devices. Similarly, operational data may be identified and illustrated herein within modules, and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices, and may exist, at least partially, merely as electronic signals in a system or network. When the module or portions of the module are implemented in software, the software portions are stored on one or more computer-readable media.

Reference throughout this specification to "one embodiment," "an embodiment," or similar language means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases "in one embodiment," "in an embodiment," and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.

Reference to a signal bearing medium may take any form capable of generating a signal, causing a signal to be generated, or causing execution of a program of machine-readable instructions on a digital processing apparatus. The signal bearing medium may be embodied by: transmission lines, compact disks, digital video disks, magnetic tapes, bernoulli drives, diskettes, punch cards, flash memory, integrated circuits, or other digital processing apparatus storage devices.

Furthermore, the described features, structures, or characteristics of the invention may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided, such as examples of programming, software modules, user selections, network transactions, database queries, database structures, hardware modules, hardware circuits, hardware chips, etc., to provide a thorough understanding of embodiments of the invention. However, one skilled in the relevant art will recognize that: the invention may be practiced without one or more of the specific details of the specific embodiment or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.

The accompanying schematic flow chart diagrams are generally set forth as logical flow chart diagrams. In this regard, the depicted order and labeled steps are indicative of one embodiment of the presented method. Other steps and methods may be conceived that are equivalent in function, logic, or effect to one or more steps, or portions thereof, of the illustrated method. Additionally, the format and symbols employed are provided to explain the logical steps of the method and are understood not to limit the scope of the method. Although various arrow types and line types may be employed in the flow chart diagrams, they are understood not to limit the scope of the corresponding method. Indeed, some arrows or other connectors may be used to indicate only the logical flow of the method. For instance, an arrow may indicate a waiting or monitoring period of unspecified duration between enumerated steps of the depicted method. Additionally, the order of the steps of a particular method may or may not strictly adhere to the order of the corresponding steps shown.

Solid state storage system

FIG. 1A is a schematic block diagram illustrating one embodiment of a system 100 for data management within a solid-state storage device in accordance with the present invention. The system 100 includes a solid-state storage device 102, a solid-state storage controller 104, a write data pipe 106, a read data pipe 108, solid-state storage 110, a computer 112, a client 114, and a computer network 116, which are described below.

The system 100 includes at least one solid-state storage device 102. In another implementation, the system 100 includes two or more solid-state storage devices 102, each solid-state storage device 102 may include non-volatile, solid-state memory 110, such as nano-random access memory ("nano-RAM" or "NRAM"), magnetoresistive RAM ("MRAM"), dynamic RAM ("DRAM"), phase change RAM ("PRAM") flash memory, and so forth. The solid-state storage device 102 is described in more detail in conjunction with fig. 2 and 3. The solid state storage device 102 is depicted as being located within a computer 112 that is connected to a client 114 via a computer network 116. In one embodiment, the solid state storage device 102 is located internal to the computer 112 and is connected using a system bus, such as a peripheral component interconnect extended (PCI-e) bus, a serial advanced technology attachment ("Serial ATA") bus, or the like. In another embodiment, the solid state storage device 102 is located external to the computer 112 and is connected via a universal serial bus ("USB"), an institute of Electrical and electronics Engineers ("IEEE") 1394 bus ("firewire"), or the like. In other embodiments, the solid-state storage device 102 is connected to the computer 112 in the following manner: peripheral component interconnect ("PCI") express bus, external electrical or optical bus extensions, or bus network solutions such AS infiniband or PCI express advanced switching ("PCIe-AS") or similar technologies.

In various embodiments, the solid state storage device 102 may be in the form of a dual in-line memory module ("DIMM"), a daughter card, or a micro-module. In another embodiment, the solid state storage device 102 is a component located within a rack-mounted blade. In another embodiment, the solid-state storage device 102 is contained within a package that is directly integrated into a highly integrated device (e.g., motherboard, laptop, graphics processor). In another embodiment, the separate elements comprising the solid state storage device 102 are integrated directly onto the advanced integration apparatus without going through an intermediate package.

The solid-state storage device 102 includes one or more solid-state storage controllers 104, each solid-state storage controller 104 may include a write data pipe 106 and a read data pipe 108, and each solid-state storage controller 104 further includes a solid-state memory 110, which will be described in detail below in conjunction with fig. 2 and 3.

The system 100 includes one or more computers 112 connected to the solid state storage device 102. The computer 112 may be a mainframe, server, storage area network ("SAN") storage controller, workstation, personal computer, laptop, handheld computer, supercomputer, computer cluster, network switch, router or device, database or storage device, data collection or data collection system, diagnostic system, test system, robot, portable electronic device, wireless device, or the like. In another embodiment, the computer 112 may be a client and the solid state storage device 102 operates autonomously to respond to data requests sent from the computer 112. In such an embodiment, the computer 112 and the solid state storage device 102 may be connected in the following manner: a computer network, system bus, or other communication means suitable for connecting between the computer 112 and the autonomous solid state storage device 102.

In one embodiment, the system 100 includes one or more clients 114 connected to one or more computers 112 through one or more computer networks 116. The client 114 may be a host, a server, a storage controller for a SAN, a workstation, a personal computer, a portable computer, a handheld computer, a supercomputer, a cluster of computers, a network switch, a router or device, a database or storage device, a data collection or data collection system, a diagnostic system, a test system, a robot, a portable electronic device, a wireless device, or the like. Computer network 116 may include the Internet, a wide area network ("WAN"), a metropolitan area network ("MAN"), a local area network ("LAN"), a token ring network, a wireless network, a fibre channel network, a SAN, a network attached storage ("NAS"), ESCON, or the like, or any combination of networks. The computer network 116 may also include networks from the IEEE802 family of network technologies, such as Ethernet, token Ring, WiFi, WiMax, and the like.

The computer network 116 may include servers, switches, routers, cables, radios, and other devices for facilitating network connectivity for the computers 112 and clients 114. In one embodiment, the system 100 includes a plurality of computers 112 in peer-to-peer communication over a computer network 116. In another embodiment, the system 100 includes a plurality of solid state storage devices 102 in peer-to-peer communication over a computer network 116. Those skilled in the art will recognize that other computer networks 116 may include one or more computer networks 116 and associated devices having single or redundant connections between one or more clients 114, other computers having one or more solid state storage devices 102, or one or more solid state storage devices 102 connected to one or more computers 112 having one or more solid state storage devices 102. In one embodiment, the system 100 includes two or more solid-state storage devices 102 connected to clients 116 over a computer network 118, without including a computer 112.

Storage controller managed objects

FIG. 1B is a schematic block diagram illustrating one embodiment of a system 101 for object management within a storage device in accordance with the present invention. The system 101 includes one or more storage devices 150 (each storage device 150 having a storage controller 152 and one or more data storage devices 154) and one or more requesting devices 155. The storage devices 152 are networked together and connected to one or more requesting devices 155. The requesting device 155 issues the object request to the storage device 150 a. The object request may be a request to create an object, a request to write data to an object, a request to read data from an object, a request to delete an object, a request to check an object, a request to copy an object, and the like. Those skilled in the art will recognize other object requests.

In one embodiment, the storage controller 152 and the data storage device 154 are separate devices. In another embodiment, the storage controller 152 and the data storage device 154 are integrated onto one storage device 150. In another embodiment, the data storage device 154 is a solid-state memory 110 and the storage controller is a solid-state storage device controller 202. In other embodiments, the data storage device 154 may be a hard disk drive, an optical drive, a tape storage, or similar storage device. In another embodiment, the storage device 150 may include two or more different types of data storage devices 154.

In one embodiment, the data storage device 154 is a solid-state memory 110 and is arranged as an array of solid-state storage elements 216, 218, 220. In another embodiment, the solid-state storage 110 is arranged within two or more banks (banks) 214 a-n. The solid-state memory 110 is described in more detail below in conjunction with FIG. 2B.

The storage devices 150a-n may be networked together and may operate as distributed storage devices. The storage device 150a connected to the requesting device 155 controls the object requests sent to the distributed storage devices. In one embodiment, the storage devices 150 and associated storage controllers 152 manage the objects and appear to the requesting device 155 as a distributed object file system. In this case, an instance of one type of distributed object file system is a parallel object file system. In another embodiment, the storage devices 150 and associated storage controllers 152 manage the objects and appear as distributed object file servers to the requesting devices 155. In this case, an instance of one type of distributed object file server is a parallel object file server. In these and other embodiments, the requesting device 155 may manage objects alone or in conjunction with the storage device 150 to participate in managing objects, which generally does not limit the functionality of the storage device 150 to adequately managing objects for other clients 114. In a degenerate case, each distributed storage device, distributed object file system, and distributed object file server can operate independently as a single device. The networked storage devices 150a-n may operate as distributed storage devices, distributed object file systems, distributed object file servers, and any combination thereof having one or more of these functions configured for one or more requesting devices 155. For example, storage device 150 may be configured to: the first requesting device 155a is operating as a distributed storage device and the requesting device 155b is operating as a distributed storage device and a distributed object file system. When the system 101 includes one storage device 150a, the storage controller 152a of the storage device 150a manages the objects and appears to the requesting device 155 as an object file system or object file server.

In one embodiment, wherein the storage devices 150 are networked together as distributed storage devices, the storage devices 150 act as a redundant array of independent drives ("RAID") managed by one or more distributed storage controllers 152. For example, a request to write a subject data segment results in the data segment being striped in data storage devices 154a-n according to RAID level into stripes having parity stripes. One benefit of this arrangement is that such an object management system may continue to be used in the event of a failure of an individual storage device 150 (whether a storage controller 152, data storage device 154, or other component of the storage device 150).

When redundant networks are used to interconnect the storage device 150 and the requesting device 155, the object management system can continue to be used in the event of a network failure (as long as one of the networks is still running). The system 101 having one storage device 150a may also include a plurality of data storage devices 154a, and the storage controller 152a of the storage device 150a may operate as a RAID controller and partition data segments among the data storage devices 154a of the storage device 150a, and the storage controller 152a of the storage device 150a may include parity stripes according to RAID levels.

In one embodiment, where one or more of the storage devices 150a-n is a solid state storage device 102 having a solid state storage device controller 202 and solid state memory 110, the solid state storage device 102 may be configured as a DIMM configuration, daughter card, mini module, or the like, and remain within the computer 112. The computer 112 may be a server or similar device having solid state storage devices 102, the solid state storage devices 102 being networked together and operating as a distributed RAID controller. Advantageously, the storage device 102 may employ a PCI-e, PCIe-AS, Infiniband or other high performance bus, switched bus, network bus, or network connection, and may provide a very compact, high performance RAID storage system in which individual or distributed solid state storage controllers 202 autonomously stripe data segments among the solid state memories 110 a-n.

In one embodiment, the same network that requesting device 155 uses to communicate with storage device 150 may be used by peer storage device 150a to communicate with peer storage devices 150b-n to implement RAID functionality. In another embodiment, a separate network may be used between storage devices 150 for RAID purposes. In another embodiment, the requesting device 155 may participate in a RAID process by sending a redundancy request to the storage device 150. For example, the requesting device 155 may send a first object write request to the first storage device 150a and a second object write request with the same data segment to the second storage device 150b to implement simple mirroring.

When capable of object processing within the storage device 102, only the storage controller 152 has the capability to store one data segment or object using one RAID level, while storing another data segment or object using a different RAID level or without RAID striping. These multiple RAID groups may be associated with multiple partitions within storage device 150. RAID0, RAID1, RAID5, RAID6, and compound RAID types 10, 50, 60 may be supported simultaneously among various RAID groups, including data storage devices 154 a-n. Those skilled in the art will recognize other RAID types and configurations that may also be simultaneously supported.

Moreover, because the storage controller 152 operates autonomously like a RAID controller, which is capable of performing progressive RAID and converting objects or portions of objects having one RAID level that are striped among the data storage devices 154 to another RAID level, the requesting device 155 is not affected, involved, or even detects changes in RAID levels at the time of conversion. In a preferred embodiment, facilitating a change from one level to another of the RAID configuration may be implemented autonomously on an object or even package basis and may be initialized by a distributed RAID control module running on one of the storage devices 150 or the storage controller 152. In general, RAID progression is a transition from a high performance and low efficiency storage configuration (e.g., RAID1) to a low performance and high storage efficiency storage configuration (e.g., RAID5), where the transition is dynamically initialized based on read frequency. However, it is found that progression from RAID5 to RAID1 is also possible. Other processes for initializing RAID progression may be configured or may be requested by a client or external agent (e.g., a storage system management server request). Those skilled in the art will recognize additional features and advantages of a storage device 102 having a storage controller 152, the storage controller 152 autonomously managing objects.

Solid state storage device with in-server SAN

FIG. 1C is a schematic block diagram illustrating one embodiment of a system 103 for a storage area network ("SAN") within a server in accordance with the present invention. The system 103 includes a computer 112, typically configured as a server ("server 112"). Each server 112 includes one or more storage devices 150, wherein both the server 112 and the storage devices 150 are connected to a shared network interface 156. Each storage device 150 includes a storage controller 152 and a corresponding data storage device 154. The system 103 includes clients 114, 114a, 114b, which are located either internally or externally to the server 112. Clients 114, 114a, 114b may communicate with each server 112 and each storage device 150 over one or more computer networks 116, which is substantially the same as described above.

The storage device 150 includes a DAS module 158, a NAS module 160, a storage communication module 162, an in-server SAN module 164, a common interface module 166, a proxy module 170, a virtual bus module 172, a front-end RAID module 174, and a back-end RAID module 176, which will be described below. Although the module 158 and 176 are shown as being located within the storage device 150, all or a portion of the module 158 and 176 may be located within the storage device 150, within the server 112, within the storage controller 152, or elsewhere.

The server 112 used with the intra-server SAN is a computer that functions as a server. The server 112 includes at least one server function (e.g., a file server function), but may also include other server functions. Server 112 may be part of a server farm or may serve other servers 114. In other embodiments, server 112 may be a personal computer, workstation, or other computer that houses storage device 150. The server 112 may access one or more storage devices 150 in the server 112, the storage devices 150 acting as direct attached storage ("DAS"), SAN attached storage, or network attached storage ("NAS"). The storage controller 150, which is part of an in-server SAN or NAS, may be located inside or outside the server 112.

In one embodiment, the in-server SAN appliance includes a DAS module 158, the DAS module 158 configuring at least a portion of at least one data storage device 154 as a DAS device, the data storage device 154 controlled by a storage controller 152 in the server 112, the DAS device connected to the server 112 to transmit storage requests from at least one client 114 to the server 112. In one embodiment, the first data storage device 154a is configured as a DAS of the first server 112a and is also configured as an intra-server SAN storage device for the first server 112 a. In another embodiment, the first data storage device 154a is partitioned, one partition being a DAS and the other partition being an in-server SAN. In another embodiment, at least a portion of the storage space in the first data storage device 154a is configured as a DAS of the first server 112a, and the portion of the storage space in the first data storage device 154a is further configured as an intra-server SAN of the first server 112 a.

In another embodiment, the in-server SAN device includes a NAS module 160, the NAS module 160 configuring the storage controller 152 as a NAS device for at least one client 114 and transmitting file requests from the client 114. The storage controller 152 may also be configured as an intra-server SAN device for the first server 112 a. The storage device 150 may be connected directly to the computer network 116 through the shared network interface 156, independent of the server 112 housing the storage device 150.

In one basic form, an apparatus for an intra-server SAN includes a first storage controller 152a within a first server 112a, wherein the first storage controller 152a controls at least one storage device 154 a. The first server 112a includes a network interface 156, the network interface 156 being shared by the first server 112a and the first storage controller 152 a. The intra-server SAN appliance includes a storage communication module 162 that facilitates communication between the first storage controller 152a and at least one device external to the first server 112a such that communication between the first storage controller 152a and the external device is independent of the first server 112 a. The storage communication module 162 may allow the first storage controller 152a to independently access the network interface 156a for external communication. In one embodiment, the storage communication module 162 accesses a switch in the network interface 156a to conduct network traffic directly between the first storage controller 152a and external devices.

The in-server SAN device also includes an in-server SAN module 164 that communicates storage requests using a network protocol and/or a bus protocol. The in-server SAN module 164 transmits storage requests received from internal or external clients 114, 114a independently of the first server 112 a.

In one embodiment, the device external to the first server 112a is the second storage controller 152 b. The second storage controller 152b controls at least one data storage device 154 b. The internal server SAN module 164 communicates storage requests between the first storage controller 152a and the second storage controller 152b via the network interface 156a independent of the first server 112 a. The second storage controller 152b may be located within the second server 112b or within another device.

In another embodiment, the device external to the first server 112a is a client 114 and the storage request is initiated by an external client 114, wherein the first storage controller 152a is configured as at least a portion of a SAN and the in-server SAN module 164 is independent of the first server 112a and transmits the storage request through the network interface 156 a. The external client 114 may be the second server 112b or may be external to the second server 112 b. In one embodiment, the in-server SAN module 164 may transmit storage requests from the external client 114 even if the first server 112a is unavailable.

In another embodiment, the client 114a that initiates the storage request is within a first server 112a, wherein the first storage controller 152a is configured as at least a portion of a SAN and the in-server SAN module 164 transmits the storage request over the one or more network interfaces 156a and a system bus.

Conventional SAN fabrics allow access to storage devices that are remote from server 112 as if the storage devices (e.g., direct attached storage ("DAS")) were located in server 112, so the storage devices appear as block storage devices. Storage devices that are typically connected like a SAN require SAN protocols such as fibre channel, Internet Small computer System interface ("iSCSI"), HyperSCSI, fibre connectivity ("FICON"), Ethernet high technology configuration ("ATA"), and the like. The intra-server SAN includes storage controllers 152 within the servers 112, yet still allows networking between the storage controllers 152a and remote storage controllers 152b or with external clients 114 using network protocols and/or bus protocols.

Typically the SAN protocol is a form of network protocol, and more network protocols are emerging, such as infiniband, which allows the storage controller 150a and associated data storage device 154a to be configured as a SAN and communicate with external clients 114 or a second storage controller 152 b. In another embodiment, the first storage controller 152a may communicate with the external client 114 or the second storage controller 152b using Ethernet.

The memory controller 152 may communicate with the internal memory controller 152 or the client 114a via a bus. For example, the storage controller 152 may communicate via a bus using PCI-e, which may support PCI extended input/output virtualization ("PCIe-IOV"). Other emerging bus protocols allow the system bus to extend outward beyond the computer or server 112 and allow the storage controller 152a to be configured as a SAN. One such bus is the PCIe-AS. The present invention is not limited to the simple SAN protocol but may utilize existing network and bus protocols to transfer storage requests. External devices in the form of clients 114 or external storage controllers 152b may communicate through an extended system bus or computer network 116. Storage requests as used herein include requests to write data, read data, erase data, query data, and the like, and may also include object data, metadata requests, and management requests and data block requests.

Conventional servers 112 typically have a root complex that is used to control access to devices within the server 112. Typically the root complex of the server 112 has a fabric interface 156 so that the server 112 controls all communications via the fabric interface 156. However, in the preferred embodiment of an in-server SAN device, the storage controllers 152 have independent access to the network interface 156, so that the client 114 can communicate directly with one or more of the storage controllers 152a in the first server 112 (forming the SAN); or one or more storage controllers 152a may be networked directly to a second storage controller 152b or other remote storage controller 152 to form a SAN. In a preferred embodiment, a device remote from the first server 112a may access the first server 112a or the first storage controller 152a via a single shared network address. In one embodiment, the intra-server SAN device includes a common interface module 166 that configures the network interface 156, the storage controller 152, and the server 112 so that the server 112 and the storage controller 152 may be accessed using a shared network address.

In another embodiment, the server 112 includes two or more network interfaces 156. For example, the server 112 may communicate through the network interface 156, while the storage device 150 may communicate through another interface. In another example, the server 112 includes a plurality of storage devices 150, each having a network interface 156. Those skilled in the art will recognize other configurations of the server 112 having one or more storage devices 150 and one or more network interfaces 156, where the one or more storage devices 150 access the network interfaces 156 independently of the server 112. Those skilled in the art will recognize how to extend a variety of configurations to support network redundancy and improve feasibility.

Advantageously, the in-server SAN devices greatly reduce the complexity and expense of conventional SANs. For example, a typical SAN requires that the server 112 have an external storage controller 152 and associated data storage devices 154. This requires additional space on the carrier and also requires cables, switches and the like. The cables, switching devices and other overhead required to configure a conventional SAN take up space, reducing bandwidth and increasing cost. The intra-server SAN appliance enables the storage controller 152 and associated storage devices 154 to adapt to the shape of the server 112, thereby reducing space requirements and reducing costs. The intra-server SAN may also be connected via internal and external high-speed data buses using relatively fast communication means.

In one embodiment, the storage device 150 is a solid state storage device 102, the storage controller 152 is a solid state storage controller 104, and the data storage device 154 is a solid state memory 110. The advantage of the described embodiments is that the solid-state storage device 102 described above is fast. Furthermore, the solid state storage device 102 may be configured in a DIMM that can be conveniently mounted into the server 112 and requires little space.

One or more internal clients 114a in the server 112 may also be connected to the computer network 116 through a server network interface 156, the connection of the clients typically being controlled by the server 112. This has several benefits. The client 114 may directly access the storage device 150 locally or remotely, and may initiate local or remote direct memory access ("DMA" "RDMA") of data that is transferred between the memory of the client 114a and the storage device 150.

In another embodiment, clients 114, 114a, internal or external to server 112, may act as file servers for clients 114 via one or more networks 116, while employing local attached storage devices 150 (e.g., DAS devices), network attached storage devices 150, network attached solid state storage 102 devices (that are part of an intra-server SAN, an external SAN, and a hybrid SAN). The storage device 150 may be affiliated with the DAS, SAN within the server, SAN, NAS, etc. simultaneously or in any combination thereof. Further, each storage device 150 may be partitioned in such a way that a first partition makes the storage device 150 available as a DAS, a second partition makes the storage device 150 available as an element of an intra-server SAN, a third partition makes the storage device 150 available as a NAS, a fourth partition makes the storage device 150 available as an element of a SAN, and so on. Also, the partitioning of the storage device 150 may be consistent with security and access control requirements. Those skilled in the art will recognize that any number of combinations and permutations of the following devices, including storage devices, virtual storage devices, storage networks, virtual storage networks, private memories, shared memories, parallel file systems, parallel object file systems, block storage devices, object storage devices, network appliances, and the like, may be constructed and supported.

Further, by connecting directly to the computer network 116, the storage devices 150 can communicate with each other and can function as an intra-server SAN. The storage device 150 (e.g., a SAN) is accessible to clients 114a within the server 112 and to clients 114 connected via the computer network 116. By moving the storage devices 150 into the servers 112 and providing the capability to configure the storage devices 150 as SANs, the server 112/storage device 150 combination reduces the need for dedicated storage controllers, fibre channel networks, and other devices in conventional SANs. An advantage of the in-server SAN system 103 is that it enables the storage devices 150 to share common resources, such as power, cooling resources, management resources, and physical space, with the clients 114 and computers 112. For example, the storage device 150 may fill empty slots of the server 112, providing performance, reliability, and utility of a SAN or NAS. Those skilled in the art will recognize other features and advantages of the in-server SAN system 103.

In another configuration, multiple intra-server SAN storage devices 150a are configured in a single server 112a infrastructure. In one embodiment, the server 112a includes one or more internal blade server clients 114a (interconnected using PCI-extended IOV) without an external network interface 156, external clients 114, 114b, or external storage device 150 b.

Further, the intra-server SAN storage devices 150 may communicate with equivalent storage devices 150 in the computers 112 via one or more computer networks 116 (see fig. 1A), or may be directly connected to the computer networks 116 without the computers 112, forming a hybrid SAN having the full capabilities of both SAN and intra-server SAN. The advantage of this flexibility is to simplify scalability and mobility between the many possible solid-state storage network facilities. Those skilled in the art will recognize other combinations, configurations, implementations, and architectures to position and interconnect the solid state controllers 104.

The network interface 156a may be controlled by only one agent running in the server 112a, a link establishment module 168 running in the agent establishing a communication path between the internal client 114a and the storage device 150 a/first storage controller 152a via the network interface 156a in communication with the external storage device 150b and the clients 114, 114 b. In the preferred embodiment, once the communication path is established, a single internal storage device 150a can establish and manage its command queue with the internal client 114a, transmit commands and data uni-directionally and directly to the external storage device 150b and clients 114, 114b through the network interface 156, and control the network interface 156a independently of the proxy or proxy through RDMA. In one embodiment, the link establishment module 168 establishes the communication link during an initialization process (e.g., startup or hardware initialization).

In another embodiment, the proxy module 170 sends at least a portion of the commands used to communicate the storage request through the first server 112a while communicating at least data (and possibly other commands) associated with the storage request between the first storage controller and the external storage device independent of the first server. In another embodiment, the proxy module 170 sends commands or data on behalf of the internal storage device 150a and the client 114 a.

In one embodiment, the first server 112a includes one or more servers within the first server 112a, and further includes a virtual bus module 172, the virtual bus module 172 allowing one or more servers in the first server 112a to independently access one or more storage controllers 152a over different virtual buses. The virtual bus module 172 may be established using a higher level bus protocol, such as PCIe-IOV. The IOV enabled network interface 156a may allow one or more servers and one or more storage controllers to independently control one or more network interfaces 156 a.

In various embodiments, the intra-server SAN appliance allows two or more storage devices 150 to be configured into a RAID. In one embodiment, the intra-server SAN device includes a front-end RAID module 174, the front-end RAID module 174 configuring two or more storage controllers 152 as RAID. When the storage request from a client 114, 114a comprises a data storage request, the front-end RAID module 174 transmits the storage request by writing data to a RAID that is consistent with the RAID level being implemented. The second storage controller 152 may be located within the first server 112a or external to the first server 112 a. The front end RAID module 174 allows RAID processing of the storage controller 152 to make the storage controller 152 visible to the client 114, 114a that sent the storage request. Striping and parity information may be managed by the storage controller 152 designated as the master controller or by the clients 114, 114 a.

In another embodiment, the intra-server SAN device includes a back-end RAID module 176, the back-end RAID module 176 configuring two or more data storage devices 154 controlled by a storage controller (e.g., RAID). Where the client storage requests comprise data storage requests, the back-end RAID module 176 accesses the RAID-configured storage devices 154 by the clients 114, 114a, such as the single data storage device 154 controlled by the first storage controller 152, by writing data to the RAID-transfer storage requests (consistent with the RAID level implemented). The RAID implementation allows the RAID process of the data storage device 154 controlled by the storage controller 152 to be transparent to any client 114, 114a accessing the data storage device 154. In another embodiment, both front-end RAID and back-end RAID may be implemented to form a multi-level RAID. Those skilled in the art will recognize other ways of RAID storage 152, the storage 152 being consistent with the solid state storage controller 104 and associated solid state storage 110 described herein.

Apparatus for storing objects managed by controller

FIG. 2A is a schematic block diagram illustrating one embodiment of an apparatus 200 for object management within a storage device in accordance with the present invention. The apparatus 200 includes a storage controller 152, the storage controller 152 having: an object request receiver module 260, a parsing module 262, a command execution module 264, an object index module 266, an object request queuing module 268, a packetizer 302 having a message module 270, and an object index reconstruction module 272, which are described below.

The storage controller 152 is generally similar to the storage controller 152 described for the system 102 in FIG. 1B, and may be the solid-state storage device controller 202 described for FIG. 2. The apparatus 200 includes an object request receiver module 260, the object request receiver module 260 receiving object requests from one or more requesting devices 155. For example, for a storage object data request, the storage controller 152 stores the data segments in the form of data packets in the data storage device 154, the data storage device 154 being coupled to the storage controller 152. The object requests an object that is typically managed by a data segment instruction storage controller that is or will be stored in one or more object data packets. The object request may request the storage controller 152 to create an object that will then be filled with data by a later object request that may utilize local or remote direct memory read ("DMA", "RDMA") translation.

In one embodiment, the object request is a write request to write all or a portion of an object to a previously created object. In one example, the write request is for a data segment of an object. Other data segments of the object may be written to storage device 150 or to other storage devices 152. In another example, the write request is for an entire object. In another example, the object request is to read data from a data segment managed by the storage controller 152. In yet another embodiment, the object request is a delete request to delete a data segment or object.

Advantageously, the storage controller 152 can accept write requests that do not merely write new objects or add data to existing objects. For example, the write request received by the object request receiver module 260 may include: a request to add data before data stored by the storage controller 152, a request to insert data in stored data, or a request to replace a segment of data. The object index maintained by the storage controller 152 provides the flexibility required for these complex write operations that are not available within other storage controllers, but are currently only available outside of the storage controllers within servers and other computer file systems.

The apparatus 200 includes a parsing module 262 that parses the object request into one or more commands. In general, the parsing module 262 parses the object request into one or more caches. For example, one or more commands in the object request may be parsed into a command cache. In general, the parsing module 262 prepares the object request so that the information in the object request can be understood and executed by the storage controller 152. Those skilled in the art will recognize other functions of the parsing module 262 that parse the object request into one or more commands.

The apparatus 200 includes a command execution module 264, and the command execution module 264 executes the command parsed from the object request. In one embodiment, the command execution module 264 executes a command. In another embodiment, the command execution module 264 executes a plurality of commands. In general, the command execution module 264 interprets commands (e.g., write commands) that are parsed from the object request, and then creates, arranges, and executes the subcommands. For example, a write command parsed from the object request may instruct the memory controller 152 to store a plurality of data segments. The object request may also include necessary attributes (e.g., encryption, compression, etc.). The command execution module 264 may command the storage controller 152 to compress the data segment, encrypt the data segment, create one or more data packets and associate a header with each data packet, encrypt the data packets using a media encryption key, add error correction codes, and store the data packets in a specified location. The data packet is stored at a specified location and other subcommands may also be broken down into other subcommands of lower levels. One skilled in the art will recognize other ways in which the command execution module 264 can execute one or more commands that are parsed from the object request.

The apparatus 200 includes an object index module 266, the object index module 266 creating an object entry in the object index in response to the storage controller 152 creating the object or storing the object data segment. Typically, the storage controller 152 creates a data packet from the data segment and, when the data segment is stored, the location where the data packet is stored is specified. Object metadata received with a data segment or as part of an object request may be stored in a similar manner.

The object index module 266 creates an object entry into the object index when storing the packet and assigning the physical address of the packet. The object entry includes a mapping between the logical identifier of the object and one or more physical addresses corresponding to where the storage controller 152 stores one or more data packets and any object metadata packets. In another embodiment, an entry is created in the object index prior to storing the data packet for the object. For example, if storage controller 152 determines the physical address where the data packet was stored earlier, object index module 266 may create an entry in the object index earlier.

Generally, when an object request or group of object requests results in an object or data segment being modified (possibly during a read modify write operation), the object index module 266 updates the entries in the object index to conform to the modified object. In one embodiment, the object index creates a new object and creates a new entry for the modified object at the object index. Typically, when only a portion of an object is modified, the object includes both the modified packets and some of the packets that remain unchanged. In this case, the new entry includes a mapping to unchanged packets (in the same location as they were originally written) and a mapping to a modified object written to the new location.

In another embodiment, the object request receiver module 260 receives an object request including a command to erase a data block or other object element, and the storage controller 152 may store at least one packet (e.g., an erase packet with information on the reference of the object, the relationship to the object, and the size of the erased data block). In addition, this may further indicate that the erased object element is filled with 0. Thus, the erase object request may be used to emulate the actual memory or store that was erased, and which actually has a portion of the appropriate memory/store that is actually stored in a cell of the memory/store with a 0.

Advantageously, creating an object index with entries that indicate mappings between data segments and object metadata allows the storage controller 152 to handle and manage objects autonomously. This capability allows for a great flexibility in storing data in the storage device 150. Once the index entry for the object is created, the storage controller 152 may efficiently handle subsequent object requests with respect to the object.

In one embodiment, the storage controller 152 includes an object request queuing module that queues one or more objects received by the object request receiver module 260 prior to parsing by the parsing module 262. The object request queuing module 268 allows flexibility between when receiving an object request and when queuing the object for execution.

In another embodiment, the storage controller 152 includes a packetizer 302 that creates one or more data packets from the one or more data segments, wherein the data packets are sized for storage within the data storage device 154. The packetizer 302 is described in more detail below in conjunction with FIG. 3. In one embodiment, the packetizer 302 includes a message module 270 that creates a header for each packet. The packet header includes a packet identifier and a packet length. The package identifier associates the package with the object for which the package was generated.

In one embodiment, each package includes a self-contained package identifier because the package identifier contains sufficient information to determine the relationship between the object and the object elements contained within the package within the object. However, a more efficient preferred embodiment is to store the package in a container.

A container is a data structure that facilitates more efficient storage of data packages and helps establish relationships between objects and data packages, metadata packages, and other packages related to objects stored within the container. Note that the storage controller 152 typically processes the data segments in a similar manner as object metadata received as part of the object is processed. In general, a "package" may refer to a data package containing data, a metadata package containing metadata, or other packages of other package types. Objects may be stored in one or more containers, and a container typically includes a package for only one unique object. The objects may be distributed among multiple containers. Containers are typically stored within a single logical erase block (storage) and are typically not interspersed among logical erase blocks.

In one example, the container may be spread across two or more logical/virtual pages. The container is determined by a container label that relates the container to the object. A container may contain from 0 to many packages and the packages within the container are typically from one object. A package may have many object element types (including object attribute elements, object data elements, object index elements, and the like). A hybrid packet may be created that includes more than one object element type. Each packet may contain from 0 to many elements of the same type. Each package within a container typically contains a unique identifier that identifies the relationship to the object.

Each package is associated with a container. In a preferred embodiment, the containers are limited to erase blocks so that a container packet can be found at or near the beginning of each erase block. This helps to limit data loss to the extent of erase blocks with corrupted headers. In such an embodiment, if the object index is not available and the packet header within the erase block is corrupted, the contents from the corrupted packet header to the end of the erase block may be lost, as there may be no reliable mechanism to determine the location of the subsequent packet. In another embodiment, a more reliable approach is to employ containers that are limited to the boundaries of the pages. This implementation requires more header overhead. In another embodiment, the container may flow across page and erase block boundaries. This method requires less header overhead, but if the header is damaged, more data may be lost. For these embodiments, it is contemplated that some type of RAID may be used to further ensure data integrity.

In one embodiment, the apparatus 200 includes an object index reconstruction module 272 that reconstructs the items in the object index using information from the packet headers stored in the data storage device 154. In one embodiment, the object index reconstruction module 272 reconstructs the items of the object index by reading the packet header (to determine the object to which each packet belongs) and the sequence information (to determine where the data or metadata belongs in the object). The object index reconstruction module 272 employs the physical address information and timestamp or sequence information of each packet to create a mapping between the physical address of the packet and the object identifier and sequence of data segments. The object index reconstruction module 272 uses the time stamp or sequence information to reproduce the order of index changes and generally therefore reconstruct the most recent state.

In another embodiment, the object index reconstruction module 272 uses the packet header information and the container packet information to place packets to identify the physical location of the packets, the object identifier, and the sequence number of each packet to reconstruct the entry in the object index. In one embodiment, upon writing a packet, the erase block is time stamped or given an erase block sequence number, and the time stamp or sequence information of the erase block is used with information from the container header and the packet header to reconstruct the object index. In another embodiment, when an erase block is recovered, time stamp or sequence information is written to the erase block.

When an object index is stored in volatile memory, errors, power loss, or other factors that cause the storage controller 152 to shut down without storing the object index may become problematic if the object index cannot be reconstructed. The object index reconstruction module 272 allows the object index to be stored in a volatile memory bank that has the advantages of a volatile memory bank (e.g., fast access). The object index reconstruction module 272 allows for fast reconstruction of the object index autonomously without relying on a device located outside the storage device 150.

In one embodiment, the object index in volatile memory is periodically stored within data storage device 154. In a specific example, the object index or "index metadata" is stored periodically in solid-state storage 110. In another embodiment, the index metadata is stored in solid-state storage 110n (separate from the solid-state storage 110a-110n-1 storage package). The index metadata is managed independently of data and object metadata that is transmitted from requesting device 155 and managed by storage controller 152/solid state storage controller 202. Managing and storing index metadata separate from other data and metadata from objects allows for efficient data flow while the storage controller 152/solid state storage device controller 202 does not unnecessarily process object metadata.

In one embodiment, where the object request received by the object request receiver module 260 comprises a write request, the storage controller 152 receives one or more object data segments from the memory of the requesting device 155 via a local or remote direct memory access ("DMA", "RDMA") operation. In the preferred example, the storage controller 152 reads data from the memory of the requesting device 155 in one or more DMA or RDMA operations. In another example, the requesting device 155 writes the data segment to the memory controller 152 in one or more DMA or RDMA operations. In another embodiment, where the object request comprises a read request, the storage controller 152 transfers one or more data segments of the object to the memory of the requesting device 155 in one or more DMA or RDMA operations. In the preferred example, the storage controller 152 writes data to the memory of the requesting device 155 in one or more DMA or RDMA operations. In another example, the requesting device reads data from the memory controller 152 in one or more DMA or RDMA operations. In another embodiment, the storage controller 152 reads the set of object command requests from the memory of the requesting device 155 in one or more DMA or RDMA operations. In another example, the requesting device 155 writes a set of object command requests to the memory controller 152 in one or more DMA or RDMA operations.

In one embodiment, the storage controller 152 emulates a block store, and the objects communicated between the requesting device 155 and the storage controller 152 include one or more data blocks. In one embodiment, the requesting device 155 includes a driver to make the storage device 150 appear as a block storage device. For example, the requesting device 152 may send a set of data of a particular size along with the physical address where the requesting device 155 expects the data to be stored. The storage controller 152 receives the data block and uses the physical block address transferred with the data block or a translated form of the physical block address as the object identifier. The storage controller 152 then stores the data block as an object or data segment of an object by packing the data block and the storage data block as desired. Object index module 266 then creates an entry in the object index using the physical block-based object identifier and the actual physical location where storage controller 152 stored the data packet, which includes the data from the data block.

In another embodiment, the storage controller 152 simulates block storage by receiving a block object. The block object may include one or more data blocks in a block structure. In one embodiment, the storage controller 152 processes the block object as any other object. In another embodiment, an object may represent an entire block device, a partition of a block device, or some other logical or physical sub-element of a block device, including tracks, sectors, channels, and the like. Of particular note is remapping a block device RAID group to an object that supports a different RAID construction (e.g., progressive RAID). Those skilled in the art will recognize other ways to map traditional or future block devices to objects.

Solid state storage device

FIG. 2B is a schematic block diagram illustrating one embodiment 201 of a solid-state storage device controller 202 within a solid-state storage device 102, the solid-state storage device controller 202 including a write data pipe 106 and a read data pipe 108 in accordance with the present invention. The solid-state storage device controller 202 may include several solid-state storage controllers 0-N, 104a-N, each controlling a solid-state storage 110. In the described embodiments, two solid state controllers are shown: solid state controller 0104 a and solid state controller N104N, and each of them controls solid state storage 110 a-N. In the depicted embodiment, the solid state storage controller 0104 a controls the data channel so that the attached solid state storage 110a stores data. The solid-state storage controller N104N controls the index metadata channel associated with the stored data such that the associated solid-state storage 110N stores index metadata. In an alternative embodiment, the solid-state storage device controller 202 includes a single solid-state controller 104a having a single solid-state memory 110 a. In another embodiment, there are a large number of solid-state storage controllers 104a-n and associated solid-state memories 110 a-n. In one embodiment, one or more solid state controllers 104a-104n-1 (connected to their associated solid state storage 110a-110 n-1) control data, while at least one solid state storage controller 104n (connected to its associated solid state storage 110 n) controls index metadata.

In one embodiment, the at least one solid state controller 104 is a field programmable gate array ("FPGA") and the controller functions are programmed into the FPGA. In a particular embodiment, the FPGA is XilinxCompany's FPGA. In another embodiment, the solid-state storage controller 104 includes components (e.g., application specific integrated circuits ("ASICs") or custom logic) specifically designed for the solid-state storage controller 104Solution). Each solid-state storage controller 104 generally includes a write data pipe 106 and a read data pipe 108, both of which are further described in conjunction with fig. 3. In another embodiment, at least one solid state storage controller 104 is comprised of a combination of FPGAs, ASICs, and custom logic components.

Solid-state memory

The solid-state memory 110 is an array of non-volatile solid-state storage elements 216, 218, 220 arranged in a memory bank 214 and accessed in parallel over a bidirectional storage input output (I/O) bus 210. In one embodiment, the storage I/O bus 210 is capable of unidirectional communication at any one time. For example, when data is written to the solid-state memory 110, the data cannot be read from the solid-state memory 110. In another embodiment, data may flow bi-directionally at the same time. However, bidirectional (as used herein with respect to a data bus) refers to a data path in which data flows in only one direction at a time, but when data flowing on a bidirectional data bus is blocked, data may flow in the opposite direction on the bidirectional bus.

Solid state storage elements (such as SSS 0.0216 a) are typically configured as chips (packages of one or more dies) or as dies on a circuit board. As described, the solid-state storage element (e.g., 216a) operates independently or semi-independently of other solid-state storage elements (e.g., 218a), even though these elements are packaged together in a chip package, a stack of chip packages, or some other packaging element. As depicted, a column of solid state storage elements 216, 218, 220 is designated as a bank 214. As depicted, there may be "n" banks 214a-n and each bank may have "m" solid-state storage elements 216a-m, 218a-m, 220a-m, thereby forming an n x m array of solid-state storage elements 216, 218, 220 in the solid-state memory 110. In one embodiment, the solid-state storage 110a includes 20 solid-state storage elements 216, 218, 220 per bank 214 (there are 8 banks 214), and the solid-state storage 110n includes two solid-state storage elements 216, 218 per bank 214 (only one bank 214). In one embodiment, each solid state storage element 216, 218, 220 is comprised of a single-level cell ("SLC") device. In another embodiment, each solid-state storage element 216, 218, 220 is comprised of a multi-level cell ("MLC") device.

In one embodiment, solid state storage elements for multiple banks sharing a common row of storage I/O bus 210a (e.g., 216b, 218b, 220b) are packaged together. In one embodiment, each chip of the solid state storage elements 216, 218, 220 may have one or more dies, with one or more chips being vertically stacked and each die being independently accessible. In another embodiment, each die of a solid state storage element (e.g., SSS 0.0216 a) may have one or more dummy die, each chip may have one or more die, and some or all of the one or more dies are stacked vertically and each dummy die may be accessed independently.

In one embodiment, each group has four stacks with two chips stacked vertically to form 8 storage elements (e.g., SSS0.0-SSS 0.8)216a-220a, each of which is located in a separate bank 214 a-n. In another embodiment, 20 storage elements (e.g., SSS0.0-SSS 20.0)216 form virtual bank 214a, so each of the eight virtual banks has 20 storage elements (e.g., SSS0.0-SSS20.8)216, 218, 220. Data is sent to the solid state storage 110 over the storage I/O bus 210 and to all storage elements of a particular group of storage elements (SSS 0.0-SSS 0.8)216a, 218a, 220 a. The storage control bus 212a is used to select a particular bank (e.g., bank-0214 a) so that data received over the storage I/O bus 210 connected to all banks 214 is written only to the selected bank 214 a.

In a preferred embodiment, the storage I/O bus 210 is comprised of one or more independent I/O buses (including 210a.a-m, 210n.a-m, "IIOBa-m"), wherein the solid state storage elements within each row share one of the independent I/O buses that access each solid state storage element 216, 218, 220 in parallel, thereby enabling access to all of the banks 214 simultaneously. For example, one channel of the storage I/O bus 210 may access the first solid-state storage elements 216a, 218a, 220a of each bank 214a-n simultaneously. The second channel of the storage I/O bus 210 may simultaneously access the second solid-state storage elements 216b, 218b, 220b of each bank 214 a-n. Each row of solid state storage elements 216, 218, 220 is accessed simultaneously. In one embodiment, where the solid state storage elements 216, 218, 220 are multi-tiered (physically stacked), all physical tiers of the solid state storage elements 216, 218, 220 are accessed simultaneously. As used herein, "simultaneous" also includes near-simultaneous access, where devices are accessed at slightly different time intervals to avoid switching noise. In this case, simultaneous is used to distinguish from sequential or serial access, in which commands and/or data are sent separately and sequentially.

Typically, the memory control bus 212 is employed to independently select the banks 214 a-n. In one embodiment, the bank 214 is selected using chip enable or chip select. The storage control bus 212 may select one of the multiple layers of solid state storage elements 216, 218, 220 when both chip select and chip enable are available. In other embodiments, the storage control bus 212 uses other commands to individually select one of the multiple layers of solid state storage elements 216, 218, 220. The solid state storage elements 216, 218, 220 may also be selected through a combination of control and address information transmitted over the storage I/O bus 210 and the storage control bus 212.

In one implementation, each solid state storage element 216, 218, 220 is partitioned into erase blocks, and each erase block is partitioned into pages. A typical page has a capacity of 2000 bytes ("2 kB"). In one example, a solid state storage element (e.g., SSS 0.0) includes two registers and can be programmed to two pages, such that the dual register solid state storage element 216, 218, 220 has a capacity of 4 kB. A bank 214 of 20 solid-state storage elements 216, 218, 220 will have a page access capacity of 80kB while the same address flows out of the channel of the storage I/O bus 210.

This set of 80kB sized pages in the bank 214 of solid state storage elements 216, 218, 220 may be referred to as virtual pages. Similarly, the erase blocks of each storage element 216a-m of the bank 214a may be grouped to form virtual blocks. In a preferred embodiment, a page erase block located within a solid state storage element 216, 218, 220 is erased when an erase command is received in the solid state storage element 216, 218, 220. However, the size and number of erase blocks, pages, planar layers, or other logical and physical portions within the solid state storage elements 216, 218, 220 are expected to vary as technology advances, and it is expected that many embodiments consistent with the new configuration are possible and consistent with the general description herein.

Generally, when a packet is written to a particular location within a solid-state storage element 216, 218, 220, where the packet is to be written to a location within a particular page corresponding to a page of a particular erase block of a particular element of a particular bank, a physical address is sent over the storage I/O bus 210 after the packet is sent. The physical address contains sufficient information to cause the solid state storage elements 216, 218, 220 to import the packet into a specified location within the page. Since storage elements on a row of storage elements (e.g., SSS 0.0-SSS 0. N216 a, 218a, 220a) are simultaneously accessed over the appropriate bus within the storage I/O bus 210a.a, in order to reach the appropriate page and write the packet to a page with a similar address in the row of storage elements (SSS 0.0-SSS 0. N216 a, 218a, 220a), the storage control bus 212 simultaneously selects the bank 214a (including the solid state storage element SSS 0.0216a with the correct page to write the packet to).

Similarly, a read command transmitted on the storage I/O bus 210 requires a command transmitted simultaneously on the storage control bus 212 to select the appropriate page within the individual bank 214a and bank 214. In the preferred embodiment, the read command reads an entire page, and because there are many solid state storage elements 216, 218, 220 in parallel within the memory bank 214, the read command reads an entire virtual page. However, the read command may be split into sub-commands, as will be explained below in connection with bank interleaving. Virtual pages may also be accessed in write operations.

An erase block erase command may be issued over the storage I/O bus 210 to erase an erase block having a particular erase block address to erase the particular erase block. In general, erase block erase commands may be sent through the parallel paths of storage I/O bus 210 to erase virtual erase blocks, each having a particular erase block address to erase the particular erase block. At the same time, a particular bank (e.g., bank-0214 a) is selected over the storage control bus 212 to prevent erasing erase blocks with similar addresses in all banks (banks 1-N214 b-N). Other commands may also be sent to specific locations using a combination of the storage I/O bus 210 and the storage control bus 212. Those skilled in the art will recognize other methods of selecting a particular memory location using the bidirectional memory I/O bus 210 and the memory control bus 212.

In one embodiment, the packets are written sequentially to the solid-state memory 110. For example, a packet flows to a storage write buffer of bank 214a of storage elements 216, and when the buffer is saturated, the packet is programmed into a designated virtual page. The packet then refills the memory write buffer and when the memory buffer is again saturated, the packet is written to the next virtual page. This process (virtual page by virtual page) typically continues until the virtual block is filled. In another embodiment, the data flow may continue past the virtual erase block boundary as this process (virtual erase block by virtual erase block) continues.

In read, modify, write operations, the data packet associated with the object is located and read in a read operation. The data segments of the modified object that have been modified are not written to the locations where they were read. Instead, the modified data segment is again converted to a data packet and then written to the next available location in the virtual page being written. The object index entry for each data packet is modified to point to the packet containing the modified data segment. The entry (or entries) in the object index for the packet associated with the same object (unmodified) would include a pointer to the source location of the unmodified packet. Thus, if a source object remains unchanged (e.g., the previous version of the object remains unchanged), the source object will have a pointer in the object index to all packets as originally written. The new object will have pointers to some of the source packets and pointers to the modified packets in the virtual page being written in the object index.

In a copy operation, the object index includes entries for source objects that map to packets stored in solid state storage 110. When the copy is complete, a new object is created and a new entry is created in the object index that maps the new object to the source package. The new object is also written to solid-state memory 110 and the address of the new object maps to a new entry in the object index. The new object package may be used to determine a package in the source object that is referenced against changes in the un-replicated source object and against object index loss or corruption.

Advantageously, writing packets sequentially facilitates smoother use of the solid state storage 110 and allows the solid state storage device controller 202 to monitor storage hot spots and layer usage of different virtual pages within the solid state storage 110. Writing packets sequentially may also help to build a powerful, efficient garbage collection system, as will be described in detail below. Those skilled in the art will recognize other benefits of storing data packets sequentially.

Solid-state storage device controller

In various embodiments, the solid-state storage device controller 202 may also include a data bus 204, a local bus 206, a buffer controller 208, buffers 0-N222 a-N, a host controller 224, a direct memory access ("DMA") controller 226, a memory controller 228, a dynamic memory array 230, a static random access memory array 232, a management controller 234, a management bus 236, a bridge 238 to connect to a system bus 240, and miscellaneous logic 242, which will be described below. In other embodiments, the system bus 240 is connected to one or more network interface cards ("NICs") 244, some of which may include a remote DMA ("RDMA") controller 246, one or more central processing units ("CPUs") 248, one or more external memory controllers 250 and associated external memory arrays 252, one or more memory controllers 254, a peer controller 256, and a special purpose processor 258, as will be described below. The components 244 and 258 connected to the system bus 240 may be located within the computing system 112 or may be other devices.

In general, the solid state storage controller 104 is in data communication with the solid state storage 110 via a storage I/O bus 210. In a typical embodiment, solid state memory is arranged within banks 214, and each bank 214 includes a plurality of storage elements 216, 218, 220 accessed in parallel, and the storage I/O bus 210 is an array of multiple buses, one for each row of storage elements 216, 218, 220 within the bank 214. As used herein, the term "storage I/O bus" may refer to a storage I/O bus 210 or an array of independent data buses 204. In a preferred embodiment, each storage I/O bus 210 accessing a row of storage elements (e.g., 216, 218a, 220a) may include a logical-to-physical mapping of the storage (e.g., erase blocks) accessed in the row of storage elements 216, 218a, 220 a. This mapping allows logical addresses mapped to physical addresses of the memory sections to be remapped to different memory sections if the first memory section fails, partially fails, is inaccessible, or has some other problem. Remapping is further explained with respect to the remapping module 314 in fig. 3.

Data may also be transferred from requesting device 155 to solid state storage controller 104 through system bus 240, bridge 238, local bus 206, buffer 22, and ultimately through data bus 204. The data bus 204 is typically coupled to one or more buffers 222a-n controlled by the buffer controller 208. Buffer controller 208 generally controls the transfer of data from local bus 206 to buffer 222 and through data bus 204 to pipe input buffer 306 and output buffer 330. To resolve clock domain differences, prevent data collisions, etc., buffer controller 208 typically controls the manner in which data from a requesting device is temporarily stored in buffer 222, and thereafter transferred to data bus 204 (or vice versa). Buffer controller 208 is typically used in conjunction with main controller 224 to coordinate the flow of data. When data arrives, it arrives on system bus 240 and is passed to local bus 206 through bridge 238.

Typically, data is transferred from the local bus 206 to one or more data buffers 222 under the control of the main controller 224 and buffer controller 208. The data then flows from the buffer 222 through the solid state controller 104 to the data bus 204 and to the solid state memory 110 (e.g., NAND flash or other storage medium). In a preferred embodiment, data is delivered with associated out-of-band metadata ("object metadata") that arrives with the data using one or more data channels, including one or more solid-state storage controllers 104a-104n-1 and associated solid-state storage 110a-110n-1, while at least one channel (solid-state storage controller 104n, solid-state storage 110n) is used for in-band metadata (e.g., index information and other metadata generated internally by solid-state storage device 102).

The local bus 206 is typically a bi-directional bus or set of buses that allows data and commands to be communicated between devices within the solid-state storage device controller 202, as well as commands and data to be communicated between devices within the solid-state storage device 102 and the device 244 connected to the system bus 240 and 258. Bridge 238 facilitates communication between local bus 206 and system bus 240. Those skilled in the art will recognize other embodiments such as ring configurations or switched star configurations and functions of buses 240, 206, 204 and bridge 238.

The system bus 240 is typically a bus of a computer, other device in which the solid state storage device 102 is installed or connected. In one embodiment, system bus 240 may be a PCI-e bus, a serial advanced technology attachment ("Serial ATA") bus, a parallel ATA or similar bus. In another embodiment, system bus 240 is an external bus such As a small computer system interface ("SCSI"), firewall, fibre channel, USB, PCIe-As, or similar bus. The solid state storage device 102 may be packaged as suitable for placement inside a device or as an externally connected device.

The solid-state storage device controller 202 includes a main controller 224 that controls higher level functions within the solid-state storage device 102. In various embodiments, master controller 224 controls the flow of data by interpreting object requests and other requests, directing the creation of an index that maps an object identifier associated with the data to a physical address of the associated data (or a coordinated DMA request, etc.). The main controller 224 controls many of the functions described herein, in whole or in part.

In one embodiment, the main controller 224 is an embedded controller. In another embodiment, the main controller 224 employs a local memory, such as a dynamic memory array 230 (dynamic random access memory "DRAM"), a static memory array 323 (static random access memory "SRAM"), or the like. In one embodiment, the local memory is controlled using a master controller 224. In another embodiment, the host controller accesses local memory through memory controller 228. In another embodiment, the host controller runs a Linux server and may support various common server interfaces such as the world Wide Web, HyperText markup language ("HTML"), and the like. In another embodiment, the main controller 224 employs a nano-processor. The master controller 224 may be implemented using programmable or standard logic or any combination of the above controller types. Those skilled in the art will recognize many embodiments of a master controller.

In one embodiment, where the storage device 152/solid state storage device controller 202 manages multiple data storage devices/solid state memories 110a-n, the master controller 224 distributes the workload among the internal controllers (e.g., solid state storage controllers 104 a-n). For example, the master controller 224 may partition an object to be written to data storage devices (e.g., solid state memories 110a-n) such that each attached data storage device stores a portion of the object. This feature is a performance enhancement that allows faster storage and access to objects. In one embodiment, the main controller 224 is implemented using an FPGA. In another embodiment, firmware located within host controller 224 may be updated via management bus 236, system bus 240 connected to NIC244 via a network, or other device connected to system bus 240.

In one embodiment, the host controller 224 of the management object emulates block storage, causing the computer 102 or other device connected to the storage device 152/solid state storage device 102 to treat the storage device 152/solid state storage device 102 as a block storage device and send the data to a particular physical address in the storage device 152/solid state storage device 120. The master controller 224 then allocates blocks and stores the data blocks as memory objects. The master controller 224 then maps the block and the physical address sent with the block to the actual location determined by the master controller 224. The mapping is stored in an object index. Generally, for block emulation, a block device application program interface ("API") is provided in the computer 112, client 114, or other device that wishes to use the storage device 152/solid state storage device 102 as a block storage device.

In another embodiment, main controller 224 operates in conjunction with NIC controller 244 and embedded RDMA controller 246 to provide on-time RDMA data and command set transfers. NIC controller 244 may be hidden behind a non-transparent port to enable use of custom drivers. Likewise, a driver on client 114 may access computer network 118 through an I/O storage driver that employs standard stack APIs and operates in conjunction with NIC 244.

In one embodiment, the master controller 224 is also a redundant array of independent drives ("RAID") controller. When the data storage device/solid state storage device 120 is networked with one or more other data storage devices/solid state storage devices 120, the master controller 224 may be a RAID controller for single-tier RAID, multi-tier RAID, progressive RAID, and the like. The master controller 224 also allows some objects to be stored within a RAID array while other objects are not stored by RAID. In another embodiment, the master controller 224 may be a distributed RAID controller element. In another embodiment, the master controller 224 may include many RAIDs, distributed RAIDs, and other functions described otherwise.

In one embodiment, master controller 224 operates in conjunction with a single or multiple network managers (e.g., switches) to establish routing, balance bandwidth usage, failover, and the like. In another embodiment, the main controller 224 operates in conjunction with integrated application specific logic (via the local bus 206) and associated driver software. In another embodiment, the main controller 224 operates in conjunction with an attached special purpose processor 258 or logic device (via the external system bus 240) and associated driver software. In another embodiment, the master controller 224 operates in conjunction with remote dedicated logic devices (via the computer network 118) and associated driver software. In another embodiment, the main controller 224 operates in conjunction with the local bus 206 or an external bus attached to a hard disk drive ("HDD") storage controller.

In one embodiment, the master controller 224 communicates with one or more storage controllers 254, where the storage devices/solid state storage devices 120 may appear as storage devices connected through a SCSI bus, Internet SCSI ("iSCSI"), fibre channel, or the like. Meanwhile, the storage device/solid-state storage device 120 may autonomously manage objects and may behave as an object file system or a distributed object file system. The master controller 224 may also be accessed through a peer controller 256 and/or a dedicated processor 258.

In another embodiment, the master controller 224 operates in conjunction with the autonomous integrated management controller to periodically verify FPGA code and/or controller software, verify FPGA code at run (reset) and/or controller software during power-up (reset), support external reset requests, support reset requests that time out due to check packets, and support voltage, current, power, temperature and other environmental measurements and threshold interrupt settings. In another embodiment, the main controller 224 manages garbage collection to free up erase blocks for reuse. In another embodiment, the master controller 224 manages wear leveling. In another embodiment, the master controller 224 allows the data storage/solid state storage 102 to be partitioned into multiple virtual devices and allows partition-based media encryption. In yet another embodiment, the master controller 224 supports a solid state memory controller 104 with advanced, multi-bit ECC correction. Those skilled in the art will recognize other features and functionality of the master controller 224 located within the storage controller 152 (or more specifically within the solid state storage device 102).

In one embodiment, the solid-state storage device controller 202 includes a memory controller 228, the memory controller 228 controlling a dynamic random access memory array 230 and/or a static random access memory array 232. As described above, the memory controller 228 may be used independently of the main controller 224 or integrated with the main controller 224. The memory controller 228 generally controls the verification of some memory types, such as DRAM (dynamic random access memory array 230) and SRAM (static random access memory array 232). In other examples, memory controller 228 also controls other memory types, such as electrically erasable programmable read-only memory ("EEPROM"), and the like. In other embodiments, memory controller 228 controls two or more memory types and memory controller 228 may include more than one controller. Typically, the memory controller 228 controls as much SRAM232 as possible, and the SRAM232 is replenished by DRAM 230.

In one embodiment, the object index is stored in the memory 230, 232 and periodically unloaded into a channel of the solid-state memory 110n or other non-volatile memory. One skilled in the art will recognize other uses and configurations for memory controller 228, dynamic memory array 230, and static memory array 232.

In one embodiment, the solid-state storage device controller 202 includes a DMA controller 226, the DMA controller 226 controlling DMA operations between: a storage device/solid state storage device 102, one or more external memory controllers 250, an associated external memory array 252, and a CPU 248. It should be noted that the external memory controller 250 and the external memory array 252 are referred to as external because they are external to the storage device/solid state storage device 102. In addition, DMA controller 226 may also control RDMA operations for the requesting device through NIC244 and associated RDMA controller 246. DMA and RDMA are described in detail below.

In one embodiment, the solid-state storage device controller 202 includes a management controller 234 connected to a management bus 236. The management controller 234 generally manages the environmental metrics and status of the storage device/solid-state storage device 102. The management controller 234 may monitor device temperature, fan speed, power supply settings, etc. via the management bus 236. The management controller may support electrically erasable programmable read-only memory ("EEPROM") to store FPGA code and controller software. Typically, the management bus 236 is connected to different components within the storage device/solid state storage device 102. The management controller 234 may communicate alarms, interrupts, etc. over the local bus 206 or may include a separate connection to the system bus 240 or other bus. In one embodiment, the management bus 236 is an inter-integrated circuit ("I2C") bus. Those skilled in the art will recognize other functions and uses of the management controller 234 coupled to the components of the storage/solid state storage device 102 through the management bus 236.

In one embodiment, the solid-state storage device controller 202 includes miscellaneous logic blocks 242, which miscellaneous logic blocks 242 may be customized for exclusive use. In general, when the solid state device controller 202 or the master controller 224 is configured using an FPGA or other configurable controller, custom logic may be included based on specific applications, user requirements, storage requirements, and the like.

Data pipeline

FIG. 3 is a schematic block diagram illustrating one embodiment 300 of a solid-state storage device controller 104 within a solid-state storage device 102 having a write data pipe 106 and a read data pipe 108 in accordance with the present invention. The implementation 300 includes a data bus 204, a local bus 206, and a buffer controller 208, which are substantially similar to the devices described with respect to the solid-state storage device controller 202 in fig. 2. The write data pipe includes a packetizer 302 and an error correction code ("ECC") generator 304. In other embodiments, the write data pipeline includes an input buffer 306, a write sync buffer 308, a writer module 310, a compression module 312, an encryption module 314, a garbage collector bypass 316 (partially located within the read data pipeline), a media encryption module 318, and a write buffer 320. The read data pipe 108 includes a read sync buffer 328, an ECC error correction module 322, a unpacker 324, an alignment module 326, and an output buffer 330. In another embodiment, the read data pipeline 108 may include a media decryption module 332, a portion of the garbage collector bypass 316, a decryption module 334, a decompression module 336, and a reader module 338. The solid-state storage controller 104 may also include a control and status register 340 and a control queue 342, a bank interleave controller 344, a synchronization buffer 346, a storage bus controller 348, and a multiplexer ("MUX") 350. The components of the solid state controller 104 and the associated write data pipe 106 and read data pipe 108 are described below. In other implementations, the synchronous solid state memory 110 may be employed and the synchronous buffers 308, 328 may not be used.

Write data pipeline

The write data pipe 106 includes a packetizer 302 that receives, directly or indirectly through another stage of the write data pipe 106, data or metadata segments to be written to solid state memory and creates one or more packets sized for the solid state memory 110. The data or metadata segment is typically part of an object, but may also include the entire object. In another embodiment, the data segment is a portion of a data block, but may also include the entire data block. Typically, the objects are received from a computer 112, client 114, or other computer or device and are transferred to the solid state storage device 102 in the form of data segments that flow to the solid state storage device 102 or computer 112. A data segment may also be referred to by another name (e.g., a data wrapper), and references herein to a data segment include all or a portion of an object or data block.

Each object is stored as one or more packages. Each object may have one or more container packages. Each packet contains a header. The packet header may include a header type field. The type field may include data, object properties, metadata, data segment delimiters (multi-pack), object structure, object connections, and the like. The packet header may also include information about the size of the packet (e.g., the number of bytes of data within the packet). The length of the packet may be determined by the packet type. One example might be to use an offset value of the packet header to determine the location of a data segment within an object. Those skilled in the art will recognize other information contained within the header added to the data by packetizer 302 and other information added to the data packet.

Each packet includes a header and possibly data from the data and metadata segments. The header of each packet includes relevant information for associating the packet with the object to which the packet belongs. For example, the packet header may include an object identifier and an offset value that indicates a data segment, object, or data block for packet formation. The packet header may also include a logical address that the memory bus controller 348 uses to store the packet. The header may also include information about the size of the packet (e.g., the number of bytes in the packet). The packet header may also include a sequence number that identifies the location of the data segment relative to other packets within the object when the data segment or object is created. The packet header may include a header type field. The type field may include data, object properties, metadata, data segment delimiters (multi-pack), object structure, object connections, and the like. Those skilled in the art will recognize other information contained in the header that is added to the data by packetizer 302 and other information added to the data packet.

The write data pipe 106 includes an ECC generator 304 that generates one or more error correction codes ("ECCs") for one or more packets received from the packetizer 302. The ECC generator 304 typically employs an error correction algorithm to generate the ECC, which is stored with the packet. ECC stored with packets is typically used to detect and correct errors due to transmission and storage. In one embodiment, the packets flow into the ECC generator 304 as uncoded blocks of length N. And calculating and adding the concurrent bits with the length of S, and outputting the concurrent bits as the coded blocks with the length of N + S. The values of N and S depend on the characteristics of the algorithm that is selected to achieve a particular performance, efficiency and robustness index. In the preferred embodiment, there is no fixed relationship between ECC blocks and packets; a packet may include more than one ECC block; an ECC block may include more than one packet; and the first packet may terminate anywhere within the ECC block while the second packet may begin where the first packet within the same ECC block terminates. In a preferred embodiment, the ECC algorithm cannot be dynamically modified. In a preferred embodiment, the ECC stored with the data packet is robust enough to correct errors in more than two bits.

Advantageously, employing a robust ECC algorithm that allows more than one bit correction or even two bit correction allows for extending the lifetime of the solid-state memory 110. For example, if flash memory is used as the storage medium in the solid-state memory 110, the flash memory can be written approximately 100000 times per erase cycle without errors. This lifetime can be extended by robust ECC algorithms. The solid state storage device 102, on board with the ECC generator 304 and corresponding ECC correction module 322, may correct errors within the solid state storage device 102 and have a longer life than if less robust ECC algorithms (e.g., single bit error correction) were employed. However, in other embodiments, ECC generator 304 may employ a less robust algorithm and may correct single or double bit errors. In another embodiment, the solid-state storage device 110 may include less reliable memory, such as multi-level cell ("MLC") flash memory, which may not be sufficiently reliable without a robust ECC algorithm, to increase capacity.

In one embodiment, the write data pipeline includes an input buffer 306 that receives a data segment to be written to the solid state memory 110 and stores the incoming data segment until the next stage of the write data pipeline 106, such as the packetizer 302 (or other stage of the more complex write data pipeline 106), is ready to process the next data segment. By using a data buffer of appropriate size, the input buffer 306 generally allows for a rate difference between the write data pipeline 106 receiving and processing data segments. The input buffer 306 also allows the data bus 204 to transfer data to the write data pipe 106 at a rate greater than the write data pipe 106 can support, thereby improving the efficiency at which the data bus 204 operates. Typically, when the write data pipe 106 does not include an input buffer 306, the buffering functionality is implemented elsewhere (e.g., the solid state storage device 102), but elsewhere outside the write data pipe 106, within the computer, such as when using remote direct memory read ("RMDA"), such as within a network interface card ("NIC"), or on another device.

In another embodiment, the write data pipeline 106 also includes a write sync buffer 308, the write sync buffer 308 buffering packets received from the ECC generator 304 before writing the packets to the solid-state memory 110. Write synchronization buffer 308 is located on the boundary between the local clock domain and the solid-state storage clock domain and provides buffering to account for clock domain differences. In other embodiments, the synchronous buffers 308, 328 may be removed using the synchronous solid state memory 110.

In one embodiment, the write data pipeline 106 further includes a media encryption module 318, the media encryption module 318 receiving one or more packets directly or indirectly from the packetizer 302 and encrypting the one or more packets with an encryption key unique to the solid state storage device 102 before sending the packets to the ECC generator 304. Typically, the entire packet (including the header) is encrypted. In another embodiment, the header is not encrypted. In this context, encryption key is understood to mean, in one embodiment, a secret encryption key that is managed externally, such key integrating the solid-state memory 110 with the device that needs encryption protection. The media encryption module 318 and corresponding media decryption module 332 provide a level of security for the data stored in the solid-state memory 110. For example, when data is encrypted with a media encryption module, if the solid state storage 110 is connected to a different solid state storage controller 104, solid state storage device 102, or computer 112, typically, the contents of the solid state storage 110 cannot be read without reasonable effort when the same encryption key (used during writing of data to the solid state storage 110) is not used.

In typical implementations, the solid-state storage device 102 does not store the encryption key in non-volatile memory and does not allow external access to the encryption key. The encryption key is provided to the solid-state storage controller 104 during initialization. The solid-state storage device 102 may use and store a non-secrecy encryption nonce that is used in conjunction with the encryption key. A different nonce may be stored with each packet. To enhance protection, the encryption algorithm may use a unique nonce to split the data segment between multiple packets. The encryption key may be received from a client 114, computer 112, key manager, or other device that manages encryption keys used by the solid state storage controller 104. In another embodiment, the solid-state memory 110 may have two or more partitions, and the solid-state storage controller 104 appears as if there are two or more solid-state storage controllers 104, each solid-state storage controller 104 running on a single partition within the solid-state memory 110. In such an embodiment, a unique media encryption key may be used with each partition.

In another embodiment, the write data pipeline 106 further includes an encryption module 314, the encryption module 314 directly or indirectly encrypting the data or metadata segment received from the input buffer 306 prior to sending the data segment to the packetizer 302, the data segment being encrypted with an encryption key received with the data segment. The encryption module 314 is different from the media encryption module 318 because: the encryption key used by the encryption module 318 to encrypt data is not common to the data stored within the solid-state storage device 102 and may differ on an object basis, and the encryption key may not be received with the data segment (as described below). For example, the encryption key used by the encryption module 318 to encrypt the data segment may be received with the data segment or may be received as part of a command to write an object to the location to which the data segment belongs. The solid state storage device 102 may use and store a non-secretly encrypted nonce in each object package used in conjunction with the encryption key. A different nonce may be stored with each packet. To enhance protection by encryption algorithms, a unique nonce may be utilized to split a data segment between multiple packets. In one embodiment, the nonce used by media encryption module 318 is the same nonce used by encryption module 314.

The encryption key may be received from the client 114, computer 112, key manager, or other device that maintains the encryption key used to encrypt the data segment. In one embodiment, the encryption key is transferred to the solid-state storage controller 104 from one of the solid-state storage devices 102, computers 112, clients 114, or other external agents capable of performing industry standard methods to securely transfer and protect the private and public keys.

In one embodiment, the encryption module 318 encrypts the first packet using a first encryption key received with the first packet and encrypts the second packet using a second encryption key received with the second packet. In another embodiment, the encryption module 318 encrypts the first packet using a first encryption key received with the first packet and passes the second data packet to the next stage (unencrypted). Advantageously, the encryption module 318 included within the write data pipeline 106 of the solid-state storage device 102 allows object-by-object or segment-by-segment data encryption without requiring a separate file system or other external system to track different encryption keys for storing respective objects or data segments. Each requesting device 155 or associated key manager independently manages encryption keys that are only used to encrypt objects or data segments sent by the requesting device 155.

In another embodiment, the write data pipeline 106 includes a compression module 312, the compression module 312 compressing data for the metadata segment before sending the data segment to the packetizer 302. The compression module 312 generally compresses the data or metadata segments using compression procedures known to those skilled in the art to reduce the amount of storage space occupied by the segments. For example, if the data segment includes a string of 512 bits of 0, the compression module 312 may replace the 512 bits of 0 with an encoding that indicates 512 bits of 0, where the encoding occupies much less space than the 512 bits of 0.

In one embodiment, the compression module 312 compresses the first segment using a first compression procedure and delivers the second segment (uncompressed). In another embodiment, the compression module 312 compresses the first segment using a first compression procedure and compresses the second segment using a second compression procedure. It is advantageous to have this flexibility within the solid state storage device 102 so that clients or other devices writing data into the solid state storage device 102 can each specify a compression program or so that one device specifies a compression program and another device specifies no compression. The compression program may also be selected according to default settings on a per object type or object class basis. For example, a first object of a particular object may be able to override a default compressor setting, a second object of the same object class and object type may employ a default compressor, and a third object of the same object class and object type may not be compressed.

In one embodiment, the write data pipeline 106 includes a garbage collector bypass 316, the garbage collector bypass 316 receiving data segments from the read data pipeline 108 (as part of a data bypass in a garbage collection system). Garbage collection systems typically mark packets that are no longer valid, typically because the packets are marked for deletion or because the packets have been modified and the modified data is stored in a different location. At some point, the garbage collection system determines that a certain region of memory can be restored. The determination that a region can be restored may be due to: lack of available storage space, a threshold percentage of data marked as invalid, merging of valid data, a threshold error detection rate for the region of memory, or improving performance based on data distribution, etc. The garbage collection algorithm may take into account a number of factors to determine when a region of memory is to be restored.

Once a region of memory is marked as restored, valid packets within the region must typically be restored. The garbage collector bypass 316 allows packets to be read into the read data pipe 108 and allows packets to then be transferred directly to the write data pipe 106 without routing the packets out of the solid state storage controller 104. In a preferred embodiment, the garbage collector bypass 316 is part of an autonomous garbage collection system operating within the solid state storage device 102. This allows the solid state storage device 102 to manage data such that the data is systematically propagated throughout the solid state memory 110 to improve performance, data reliability and avoid over-and under-using any one location or region of the solid state memory 110, and extend the useful life of the solid state memory 110.

The garbage collector bypass 316 coordinates the insertion of data segments into the write data pipe 106 while other data segments are written by the client 116 or other device. In the depicted embodiment, the garbage collector bypass 316 is located before the packetizer 302 in the write data pipe 106, after the depacketizer 314 in the read data pipe, but may be located elsewhere in the write and read data pipes 106, 118. The garbage collector bypass 316 may be used during flushing of the write data pipe 106 to fill in the remainder of the virtual pages, thereby increasing storage efficiency within the solid state memory 110 and thus reducing the frequency of garbage collection.

In one embodiment, the write data pipeline 106 includes a write buffer 320, the write buffer 320 buffering data for efficient write operations. Typically, the write buffer 320 includes sufficient capacity for packets to fill at least one virtual page within the solid-state memory 110. This allows a write operation to send an entire page of data to the solid-state memory 110 without interruption. By selecting the capacity of the write buffer 320 of the write data pipeline 106 and selecting the capacity of the buffer within the read data pipeline 108 to be the same size capacity or larger than the capacity of the write buffer within the solid state memory 110, the efficiency of writing and reading data is greater because a single write command can be designed to send an entire virtual page of data to the solid state memory 110, thereby replacing multiple commands with a single command.

When the write buffer 320 is filled, the solid state memory 110 may be used for other read operations. This is advantageous because: other solid state devices with smaller capacity write buffers or without write buffers may bind solid state memory when data is written to the memory write buffer and when data injected into the data buffer stalls. The read operation is intercepted until the entire memory write buffer is filled or programmed. Another approach for systems without write buffers or with write buffers of small capacity is to flush the unfilled store write buffer to enable read operations. Also, this approach is inefficient because multiple write/program cycles are required to fill the page.

For the depicted embodiment having a write buffer 320 with a capacity greater than the virtual page capacity, a subsequent command to a single write command (including a large number of sub-commands) may be a single program command to transfer a page of data from the storage write buffer in each solid-state storage element 216, 218, 220 to a designated page in each solid-state storage element 216, 218, 220. The benefits of this technique are: partial page programming is reduced, which, as is well known, reduces the reliability and stability of the data and frees the target bank for read commands and other commands when the buffer fills.

In one embodiment, write buffer 320 is an alternating buffer, where one side of the alternating buffer is filled and then designated to transfer data at the appropriate time when the other side of the alternating buffer is filled. In another embodiment, the write buffer 320 includes a first-in-first-out ("FIFO") register having a larger capacity than the virtual page of data segments. Those skilled in the art will recognize other write buffer 320 configurations that allow for the storage of virtual pages of data prior to writing the data to the solid state memory 110.

In another embodiment, the write buffer 320 is smaller in capacity than a virtual page, so that less than one page of information may be written to a storage write buffer within the solid-state memory 110. In such an embodiment, to prevent stalling of the write data pipe 106 from preventing read operations, the data is queued using a garbage collection system that needs to be moved from one location to another, as part of the garbage collection process. To prevent stalling of data in the write data pipe 106, the data may be supplied to the write buffer 320 through the garbage collector bypass 316 and then to the storage write buffer in the solid state memory 110, filling the pages of the virtual page prior to programming the data. In this way, stalling of data written into the data pipe 106 does not stall data read from the solid state storage device 102.

In another embodiment, the write data pipeline 106 includes a writer module 310, the writer module 310 having one or more user definable functionality within the write data pipeline 106. The writer module 310 allows a user to customize the write data pipe 106. A user may customize the write data pipe 106 based on a particular data request or application. When the solid state storage controller 104 is an FPGA, a user can program the write data pipe 106 with custom commands and functions with relative ease. The user may also utilize the writer module 310 to cause the ASIC to include custom functionality, however custom ASICs may be more difficult than when using FPGAs. The writer module 310 may include a buffer and bypass mechanism to allow the first data segment to execute in the writer module 310 while the second data segment may continue to be transferred through the write data pipe 106. In another embodiment, the writer module 310 may include a processor core that can be programmed by software.

It should be noted that the writer module 310 is shown as being located between the input buffer 306 and the compression module 312, however the writer module 310 may be located anywhere within the write data pipe 106 and may be distributed between the different stages 302 and 320. In addition, multiple writer modules 310 may be distributed among the different, programmed and independently operating stages 302-320. Further, the order of the stages 302-320 may be changed. Those skilled in the art will recognize the possible changes in the order of stages 302-320 based on the needs of a particular user.

Read data pipeline

The read data pipeline 108 includes an ECC error correction module 322 that determines whether an error exists in an ECC block of a request packet received from the solid state memory 110 by using the ECC stored with each ECC block in the request packet. Then, if there are any errors and the errors are correctable using ECC, the ECC correction module 322 corrects any errors in the request packet. For example, if the ECC is capable of detecting 6-bit errors but only correcting 3-bit errors, the ECC correction module 322 corrects the request packet ECC block with 3-bit errors. The ECC correction module 322 corrects the errored bits by changing them to the correct 1 or 0 state, thereby requesting that the data packet be consistent as it was written to the solid state memory 110 and generating an ECC for the packet.

If the ECC error correction module 322 determines that the request packet contains more error bits than the ECC can correct, the ECC error correction module 322 cannot correct the error requesting the bad ECC block of the packet and sends an interrupt. In one embodiment, the ECC error correction module 322 sends an interrupt along with a message indicating that the request packet is in error. The message may include information indicating that the ECC error correction module 322 is unable to correct the error or that the ECC error correction module 322 is not capable of correcting the error. In another embodiment, the ECC correction module 322 sends a request for corrupted ECC blocks in the packet along with the interrupt and/or message.

In a preferred embodiment, a corrupted ECC block or a portion of a corrupted ECC block of a request packet that cannot be corrected by the ECC error correction module 322 is read, corrected, and returned to the ECC error correction module 322 for further processing by the read data pipeline 108 by the main controller 224. In one embodiment, a request packet is sent with a corrupted ECC block or a portion of a corrupted ECC block to the device requesting the data. The requesting device 155 may modify the ECC block or replace the data with another copy (e.g., a backup or mirror copy) and may then use or return the replaced data of the requested data packet to the read data pipe 108. The requesting device 155 may use the header information in the error request packet to identify the data needed to replace the destroy request packet or the object to which the replacement packet belongs. In another preferred embodiment, the solid state storage controllers 104 employ some type of RAID to store data and to be able to recover corrupted data. In another embodiment, the ECC error correction module 322 sends an interrupt and/or message and the receiving device stops the read operation associated with the requested data packet. Those skilled in the art will recognize other options and operations that may be taken by the ECC error correction module 322 after determining that one or more ECC blocks of the request packet are corrupted and that the ECC error correction module 322 is unable to correct the error.

The read data pipe 108 includes a unpacker 324, the unpacker 324 receives the requested packet ECC block directly or indirectly from the ECC correction module 322 and checks and deletes one or more packet headers. The de-packetizer 324 may verify the packet header by checking the packet identifier, data length, data location, etc. within the packet header. In one embodiment, the packet header includes a hash code that may be used to verify that the packet passed to the read data pipe 108 is a request packet. The de-packetizer 324 also removes the header added by the packetizer 302 from the request packet. The de-packetizer 324 may be designated to pass certain packets forward unmodified as they do not. One example may be a container tag that is requested during the reconstruction process when the object index reconstruction module 272 requires header information. Further examples include transferring packets of different types (intended for use within the solid state storage device 102). In another embodiment, the unpacker 324 operation may depend on the type of packet.

The read data pipe 326 includes an alignment module 326, the alignment module 326 receiving data from the unpacker 324 and deleting excess data. In one embodiment, a read command sent to the solid-state memory 110 restores the data packet. The device requesting the data may not need all of the data in the recovered data packet and the alignment module 326 deletes the excess data. If all of the data in the recovery page is the requested data, the alignment module 326 does not delete any data.

The alignment module 326 reformats the data in the data segment of the object in a form compatible with the device requesting the data segment before the data segment is transmitted to the next stage. Typically, the size of the data segments or packets varies from level to level as the data is processed by the read data pipeline 108. The alignment module 326 uses the received data to format the data into data segments suitable for transmission to the requesting device 155, which are also suitable for concatenation together to form a response. For example, data from a portion of a first data packet may be combined with data from a portion of a second data packet. If the data segment is larger than the data requested by the requesting device, alignment module 326 may discard the unneeded data.

In one embodiment, the read data pipe 108 includes a read synchronization buffer 328, the read synchronization buffer 328 buffering one or more request packets read from the solid state memory 110 prior to processing by the read data pipe 108. Read synchronization buffer 328 is located on the boundary between the solid state storage clock domain and the local bus clock domain and provides buffering to account for clock domain differences.

In another embodiment, the read data pipe 108 includes an output buffer 330, the output buffer 330 receiving request packets from the alignment module 326 and storing the packets before they are transmitted to the requesting device. The output buffer 330 accounts for differences between when a data segment is received from the read data pipe 108 and when the data segment is transferred to other portions of the solid-state storage controller 104 or to a requesting device. The output buffer 330 also allows the data bus to receive data from the read data pipe 108 at a higher rate than the read data pipe 108 can support, to improve the efficiency with which the data bus 204 operates.

In one embodiment, the read data pipeline 108 includes a media decryption module 332, the media decryption module 332 receiving one or more encrypted request packets from the ECC error correction module 322 and decrypting one or more of the request packets using an encryption key unique to the solid-state storage device 102 before sending the one or more request packets to the unpacker 324. Typically, the encryption key used by the media decryption module 332 to decrypt the data is consistent with the encryption key used by the media encryption module 318. In another embodiment, the solid-state memory 110 may have two or more partitions and the solid-state storage controller 104 behaves as if there are two or more solid-state storage controllers 104 (each running within a separate partition within the solid-state memory 110). In such an embodiment, a unique media encryption key may be used for each partition.

In another embodiment, the read data pipe 108 includes a decryption module 334, the decryption module 334 decrypting the data segment formatted by the unpacker 324 before sending the data segment to the output buffer 330. The data segment is decrypted using an encryption key received with the read request that initiates recovery of the request packet received by the read synchronization buffer 328. The decryption module 334 may decrypt the first packet using the encryption key received with the read request for the first packet and then may decrypt the second packet using a different encryption key or may pass the second packet undecrypted to the next stage of the read data pipeline 108. Typically, the decryption module 334 decrypts the data segments using an encryption key that is different from the encryption key used by the media decryption module 332 to decrypt the requested data packet. When a packet is stored with a non-secretly encrypted nonce, the nonce is used with the encryption key to decrypt the data packet. The encryption key may be received from a client 114, computer 112, key manager, or other device that manages the encryption key used by the solid state storage controller 104.

In another embodiment, the read data pipe 108 includes a decompression module 336, the decompression module 336 decompressing the data segments formatted by the unpacker 324. In a preferred embodiment, the decompression module 336 uses compression information stored in one or both of the packet header and the container label to select a supplemental program that the compression module 312 uses to compress the data. In another embodiment, the decompression routine used by the decompression module 336 is determined by the data segment requested to be decompressed. In another embodiment, the decompression module 336 selects a decompressor based on default settings on a per object type or object class basis. A first packet of a first object may be able to override a default decompressor setting, a second packet of a second object having an opposite object class and object type may employ the default decompressor, and a third packet of a third object having the same object class and object type may not be decompressed.

In another embodiment, the read data pipeline 108 includes a read program module 338, the read program module 338 including one or more user definable functions within the read data pipeline 108. The read program module 338 has similar features to the write program module 310 and allows a user to provide custom functionality to the read data pipe 108. The read program module 338 may be located at the location shown in FIG. 3, may be located elsewhere within the read data pipe 108, or may comprise multiple portions of multiple locations within the read data pipe 108. In addition, there may be multiple independently operating read program modules 338 at multiple different locations within the read data pipeline 108. Those skilled in the art will recognize other forms of read program module 338 within read data pipe 108. Just as with the write data pipe, the stages of the read data pipe 108 may be reordered, and those skilled in the art will recognize other orders of arrangement of stages within the read data pipe 108.

The solid state storage controller 104 includes control and status registers 340 and corresponding control queues 342. The control and status registers 340 and control queues 342 facilitate control and sequencing of commands and subcommands associated with data processed within the write and read data pipes 106, 108. For example, a data segment in the packetizer 302 may have one or more corresponding control commands or instructions within a control queue 342 associated with an ECC generator. When a data segment is packed, some instructions or commands may be executed within the packer 302. Other commands or instructions may be passed directly to the next control queue 342 via the control and status register 340 as newly formed packets built from the data segments are passed to the next stage.

Commands and instructions may be loaded onto the control queue 342 at the same time to forward packets to the write data pipe 106, while each pipe stage reads the appropriate command or instruction since each pipe stage is to execute a respective packet. Similarly, commands and instructions may be loaded onto the control queue 342 at the same time to request packets from the read data pipe 108, and each pipe stage reads the appropriate command or instruction since each pipe stage is to execute a respective packet. Those skilled in the art will recognize other features and functions of control and status register 340 and control queue 342.

The solid-state storage controller 104 and/or the solid-state storage device 102 may also include a bank interleave controller 344, a synchronization buffer 346, a storage bus controller 348, and a multiplexer ("MUX") 350, which are described with respect to fig. 4A and 4B.

Memory bank interleaving

FIG. 4A is a schematic block diagram of one embodiment 400 of a bank interleave controller 344 within a solid state storage controller 104 in accordance with the present invention. The bank interleave controller 344 is coupled to the control and status register 340 and to the storage I/O bus 210 and the storage control bus 212 through MUX350, storage bus controller 348, and synchronization buffer 346, as described below. The bank interleave controller includes a read agent 402, a write agent 404, an erase agent 406, a management agent 408, read queues 410a-n, write queues 412a-n, erase queues 414a-n, management queues 416a-n for banks 214 in solid state memory 110, bank controllers 418a-n, bus arbiter 420, and status MUX422, which are described below. The storage bus controller 348 includes a mapping module 424 having a remapping module 430, a state capture module 426, and a NAND bus controller 438, which are described below.

The bank interleave controller 344 directs one or more commands to two or more queues in the bank interleave controller 344 and coordinates execution of the commands stored in the queues among the banks 214 of the solid state memory 110 such that a first type of command is executed on one bank 241a and a second type of command is executed on a second bank 214 b. And the one or more commands are respectively sent into the queue according to the command types. Each bank 214 of the solid-state storage 110 has a corresponding set of queues within the bank interleave controller 344, and each set of queues includes a queue for each command type.

The bank interleave controller 344 coordinates execution of commands stored in queues among the banks 214 of the solid-state storage 110. For example, commands of a first type are executed on one bank 241a and commands of a second type are executed on a second bank 214 b. Typically, the command types and queue types include read and write commands and queues 410, 412, but may also include other commands and queues specified by the storage medium. For example, in the embodiment depicted in FIG. 4A, erase and management queues 414, 416 are included and are suitable for flash memory, NRAM, MRAM, DRAM, PRAM, and the like.

Other types of commands and corresponding queues may be included for other types of solid state memory 110 without departing from the scope of the present invention. The flexible nature of the FPGA solid state storage controller 104 allows flexibility of the storage medium. If the flash memory is swapped for another solid-state storage type, the bank interleave controller 344, the storage bus controller 348, and the MUX350 can be changed to accommodate the media type without significantly affecting the data pipes 106, 108 and other solid-state storage controller 104 operations.

In the embodiment depicted in FIG. 4A, for each bank 214, the bank interleave controller 344 includes: a read queue 410 for reading data from the solid state memory 110, a write queue 412 for writing commands to the solid state memory 110, an erase queue 414 for erasing erase blocks in the solid state memory, a management queue 416 for managing commands. The bank interleave controller 344 also includes corresponding read, write, erase and management agents 402, 404, 406, 408. In another embodiment, the control and status register 340 and control queue 342 or similar element queues commands for data sent to the banks 214 of the solid state storage 110 without the bank interleave controller 344.

In one embodiment, the agents 402, 404, 406, 408 send appropriate types of commands destined for a particular bank 214a to the correction queue of the bank 214 a. For example, read agent 402 may receive a read command for bank-1214 b and send the read command to bank-1 read queue 410 b. Write agent 404 may receive a write command to write data to bank-0214 a of solid-state storage 110 and then send the write command to bank-0 write queue 412 a. Similarly, the erase agent 406 may receive an erase command to erase an erase block in bank-1214 b and then transmit the erase command to the bank-1 erase queue 414 b. The management agent 408 typically receives management commands, status requests, and the like, such as a reset command or a request to read a configuration register of the bank 214 (e.g., bank-0214 a). The management agent 408 sends the management command to the bank-0 management queue 416 a.

The agents 402, 404, 406, 408 also typically monitor the status of the queues 410, 412, 414, 416 and send status, interrupts, or other messages when the queues 402, 404, 406, 408 are full, near full, or non-functional. In one embodiment, the agents 402, 404, 406, 408 receive commands and generate corresponding sub-commands. In one embodiment, the agents 402, 404, 406, 408 receive commands through the control and status register 340 and generate corresponding sub-commands that are forwarded to the queues 410, 412, 414, 416. Those skilled in the art will recognize other functions of the agents 402, 404, 406, 408.

The queues 410, 412, 414, 416 typically receive commands and store the commands until the commands are required to be transmitted to the solid-state memory banks 214. In typical embodiments, the queues 410, 412, 414, 416 are first-in-first-out ("FIFO") registers or similar components that operate as FIFOs. In another embodiment, the queues 410, 412, 414, 416 store commands in an order that matches data, importance, or other criteria.

The bank controller 418 typically receives commands from the queues 410, 412, 414, 416 and generates the appropriate subcommands. For example, bank-0 write queue 412a may receive a command to write a page of a packet to bank-0214 a. The bank-0 controller 418a may receive the write command at the appropriate time and may generate one or more write sub-commands (to be written into the pages of the bank-0214 a) for each packet stored in the write buffer 320. For example, the bank-0 controller 418a may generate a command to verify the status of the bank-0214 a and the solid-state storage array 216, a command to select the appropriate location to write one or more data packets, a command to clear an input buffer located within the solid-state storage array 216, a command to transfer one or more data packets to the input buffer, a command to place an input buffer in the selected location, a command to verify that the data is properly programmed, and if a program failure occurs, interrupt the host controller one or more times, retry writing to the same physical address, and retry writing to a different physical address. Further, along with the write command in the example, the storage bus controller 348 doubles one or more commands by each of the storage I/O buses 210a-n, with the logical address of the command mapped to a first physical address for the storage I/O bus 210a and to a second physical address for the storage I/O bus 210a, as described in more detail below.

Generally, the bus arbiter 420 is selected from the bank controller 418 and extracts the sub-commands from the output queue of the bank controller 418 and issues the sub-commands to the storage bus controller 348 in a sequence that optimizes the performance of the banks 214. In another embodiment, the bus arbiter 420 may respond to high-level interrupts and modify the general selection criteria. In another embodiment, the master controller 224 may control the bus arbiter 420 via a control and status register 340. Those skilled in the art will recognize that the bus controller 420 may control and interleave the sequence of commands transmitted from the bank controller 418 to the solid state memory 110.

In general, the bus arbiter 420 coordinates the selection of the appropriate commands and corresponding data required for the command type from the bank controller 418 and sends the commands and data to the storage bus controller 348. The bus arbiter 420 also typically sends commands to the storage control bus 212 to select the appropriate bank 214. For flash or other solid state memory 110 with an asynchronous, bi-directional serial storage I/O bus 210, only one command (control information) or data set can be transferred at a time. For example, when a write command or data is transferred to the solid-state memory 110 over the storage I/O bus 210, a read command, read data, an erase command, a management command, or other status command cannot be transmitted over the storage I/O bus 210. For example, when data is read from the storage I/O bus 210, data cannot be written to the solid-state memory 110.

For example, during a write operation of bank-0, bus arbiter 420 selects bank-0 controller 418a having a write command or a series of write sub-commands at the top of its queue that causes storage bus controller 348 to perform a subsequent sequence. The bus arbiter 420 forwards the write command to the storage bus controller 348, which the storage bus controller 348 sets up the write command by: selecting a bank-0214 a over the storage control bus 212, sending a command to clear the input buffer of the solid-state storage element 110 associated with the bank-0214 a, sending a command to verify the state of the solid-state storage element 216, 218, 220 associated with the bank-0214 a. The storage bus controller 348 then transmits the write command over the storage I/O bus 210 including the physical address including the logical erase block address for each individual physically erased solid state storage element 216a-m as mapped from the logical erase block address. The memory bus controller 348 then multiplexes the write buffer via the write sync buffer to the memory I/O bus 210 through multiplexer 350 and streams the write data to the appropriate page. When the page is full, then, the storage bus controller 348 causes the solid state storage elements 216a-m associated with bank-0214 a to program an input buffer into the memory cells of the solid state storage elements 216 a-m. Finally, the storage bus controller 348 verifies the status to ensure that the page is programmed correctly.

The read operation is similar to the write operation example above. During a read operation, generally, the bus arbiter 420 or other component of the bank interleave controller 344 receives data and corresponding state information and sends the data to the read data pipeline 108 while sending the state information to the control and status registers 340. In general, a read data command transmitted from the bus arbiter 420 to the storage bus controller 348 will cause the multiplexer 350 to transmit the read data to the read data pipe 108 via the storage I/O bus 210 and to send status information to the control and status register 340 via the status multiplexer 422.

The bus arbiter 420 coordinates the different command types and data access patterns so that only the appropriate command type or corresponding data is on the bus at any given time. If the bus arbiter 420 has selected a write command and the write sub-command and corresponding data are being written to the solid state memory 110, the bus arbiter 420 will not allow other command types to exist on the storage I/O bus 210. Advantageously, the bus arbiter 420 uses timing information (e.g., predetermined command execution times) along with the received information regarding the state of the banks 214 to coordinate the execution of different commands on the bus, which aims to minimize or eliminate bus downtime.

The main controller 224 through the bus arbiter 420 typically uses the predetermined completion times and status information of the commands stored in the queues 410, 412, 414, 416 to cause the sub-command associated with the command to be executed on one bank 214a while other sub-commands of other commands are executed on other banks 241 b-n. When the bank 214a has completely executed one command, the bus arbiter 420 passes the other commands to the bank 214 a. The bus arbiter 420 may also coordinate other commands not stored in the queues 410, 412, 414, 416 along with coordinating commands stored in the queues 410, 412, 414, 416.

For example, an erase command may be issued to erase a set of erase blocks within solid state memory 110. Executing an erase command may take 10 to 1000 times more time than executing a write or read command, or 10 to 100 times more time than executing a program command. For N banks 214, the bank interleave controller may partition the erase command into N commands, each erasing a virtual erase block of bank 214 a. When bank-0214 a executes the erase command, the bus arbiter 420 may select other commands to execute on the other banks 214 b-n. The bus arbiter 420 may also work with other components (e.g., storage bus controller 348, main controller 224, etc.) to coordinate the execution of commands between the buses. Coordinating the execution of commands by agents 402, 404, 406, 408 utilizing bus arbiter 420, bank controller 418, queues 410, 412, 414, 416, and bank interleave controller may significantly improve performance (compared to other solid state storage systems without bank interleave functionality).

In one embodiment, the solid-state controller 104 includes one bank interleave controller 344, the bank interleave controller 344 serving all storage elements 216, 218, 220 of the solid-state storage 110. In another embodiment, the banks of solid state controllers 104 include an interleave controller 344 for each row 216a-m, 218a-m, 220a-m of storage elements. For example, one bank interleave controller 344 serves one row of storage elements SSS 0.0-SSS 0. N216 a, 218a, 220a, a second bank interleave controller 344 serves a second row of storage elements SSS 1.0-SSS 1. N216 b, 218b, 220b, and so on.

FIG. 4B is a schematic block diagram illustrating an alternative embodiment 401 of a bank interleave controller within a solid state storage device in accordance with the present invention. The components 210, 212, 340, 346, 348, 350, 402, 430 described in the embodiment shown in FIG. 4B are generally similar to the bank interleave device 400 described with respect to FIG. 4A, except for the following differences: each bank 214 includes a separate queue 432a-n and a separate queue 432a for the bank to which read commands, write commands, erase commands, management commands, etc., are transmitted (e.g., bank-0214 a). In one embodiment, queue 432 is a FIFO. In another embodiment, the queue 432 may have commands fetched from the queue 432 in an order different from the order of storage. In an alternative embodiment (not shown), the read agent 402, write agent 404, erase agent 406, and management agent 408 may be combined into one agent allocation command that is sent to the appropriate queue 432 a-n.

In another alternative embodiment (not shown), the commands are stored in separate queues, where the commands may be fetched from the queues in an order different from the order of storage, thereby causing the bank interleave controller 344 to execute on the remaining banks 214 b-n. Those skilled in the art will readily recognize other queue configurations and types that can execute commands on one bank 214a while executing other commands on other banks 214 b-n.

Specific storage component

The solid-state storage controller 104 includes a synchronization buffer 346, the synchronization buffer 346 transmitting and receiving command and status messages from the solid-state storage 110. Synchronization buffer 346 is located on the boundary between the solid state memory clock domain and the local bus clock domain and provides buffering to account for clock domain differences. The sync buffer 346, write sync buffer 308, and read sync buffer 328 may function independently or in concert to buffer data, commands, status messages, and the like. In the preferred embodiment, the synchronization buffer 346 is located to minimize the number of signals across the clock domains. Those skilled in the art will recognize that: synchronization between clock domains may optionally be run elsewhere on the solid-state storage device 102 to optimize certain aspects of the design implementation.

The solid state storage controller 104 includes a storage bus controller 348 that interprets and translates commands for data sent to or read from the solid state storage 110 and receives status messages from the solid state storage 110 based on the type of solid state storage 110. For example, the storage bus controller 348 may have different timing requirements for different storage types, different performance characteristics, different manufacturers' memories, and so forth. The storage bus controller 348 also sends control commands to the storage control bus 212.

In a preferred embodiment, solid-state storage controller 104 includes a MUX350, the MUX350 comprising an array of multiplexers 350a-n, where each multiplexer is for a row of solid-state storage array 110. For example, multiplexer 350a is associated with solid state storage elements 216a, 218a, 220 a. The MUX350 routes data from the write data pipe 106 and commands from the storage bus controller 348 to the solid state memory 110 via the storage I/O bus 210 through the storage bus controller 348, the synchronization buffer 346, and the bank interleave controller 344, and routes data and status messages from the solid state memory 110 to the read data pipe 108 and the control and status register 340 via the storage I/O bus 210.

In a preferred embodiment, the solid-state storage controller 104 includes a MUX350 for each row of solid-state storage elements (e.g., SSS 0.1216 a, SSS 0.2218 a, SSS 0. N220 a). The MUX350 combines data from the write data pipe 106 and commands sent to the solid state memory 110 via the storage I/O bus 210, and separates data that needs to be processed by the read data pipe 108 from commands. Packets stored in the write buffer 320 are passed by a bus outside the write buffer to the MUX350 for each row of solid state storage elements (SSS x.0 to SSS x.N 216, 218, 220) through the write buffer 308 for each row of solid state storage elements (SSS x.0 to SSS x.N 216, 218, 220). MUX350 receives commands and reads data from storage I/O bus 210. MUX350 also passes status messages to memory bus controller 348.

The storage bus controller 348 includes a mapping module 424. The mapping module 424 maps the logical address of the erase block to one or more physical addresses of the erase block. For example, a solid state storage 110 having an array of 20 storage elements (e.g., SSS 0.0 to SSS m.0216) per bank 214a may have a logical address of a particular erase block (one physical address per storage element) that maps to 20 physical addresses of the erase block. Due to the parallel access of the storage elements, the erase blocks at the same location in each of the storage elements in the row of storage elements 216a, 218a, 220a share a physical address. To select one erase block (as in storage element SSS 0.0216 a) in place of all erase blocks in a row (as in storage elements SSS 0.0, 0.1, …. N216 a, 218a, 220a), one bank (in this case bank-0214 a) may be selected.

This logical to physical mapping for erase blocks is advantageous because if one erase block is corrupted or inaccessible, the mapping may instead map to another erase block. This approach reduces the loss of the entire virtual erase block when the erase block of one element is corrupted. The remapping module 430 changes the mapping of the logical address of the erase block to one or more physical addresses (throughout the array of storage elements) of the virtual erase block. For example, virtual erase block 1 can be mapped to erase block 1 of storage element SSS 0.0216 a, erase blocks 1, … mapped to storage element SSS 1.0216 b, and mapped to storage element M.0216m, virtual erase block 2 can be mapped to erase block 2 of storage element SSS 0.1218 a, erase blocks 2, … mapped to storage element SSS 1.1218 b, and mapped to storage element M.1218m, and so on.

If the erase block 1 of storage element SSS 0.0216 a is corrupted, encounters an error due to wear, or for some reason cannot be used, the remapping module may change the mapping from logical to physical to a mapping that points to the logical address of the erase block 1 of virtual erase block 1. If an idle erase block of storage element SSS 0.0216 a (which will be referred to as erase block 221) is available and not currently mapped, the remapping module may change the mapping of virtual erase block 1 to erase block 221 which maps to storage element SSS 0.0216, while continuing to reference erase block 1 of storage element SSS 1.0216 b, erase blocks 1, … of storage element SSS 2.0 (not shown), and reference storage element M.0216 m. The mapping module 424 or the remapping module 430 may map erase blocks (virtual erase block 1 to erase block 1 of storage elements, virtual erase block 2 to erase block 2 of storage elements, etc.) in a fixed order or may map erase blocks of storage elements 216, 218, 220 in an order based on some other criteria.

In one embodiment, the erase blocks may be grouped by access time. Grouping, balancing the time of command execution by access time (e.g., programming or writing data into pages of a specified erase block) may average out command completions so that commands executed between erase blocks of a virtual erase block are not limited by the slowest erase block. In another embodiment, the erase blocks may be grouped by wear level, run-time. Those skilled in the art will recognize other issues that need to be considered when mapping or remapping erase blocks.

In one embodiment, the storage bus controller 348 includes a state capture module 426 that receives the state message from the solid state memory 110 and sends the state message to the state MUX 422. In another embodiment, when the solid state memory 110 is a flash memory, the storage bus controller 348 includes a NAND bus controller 428. The NANA bus controller 428 transfers commands from the read and write data pipes 106, 108 to the correct locations in the solid state memory 110, coordinates the timing of command execution according to the characteristics of the flash memory, and so on. If the solid-state memory 110 is another type of solid-state memory, the NAND bus controller 428 is replaced with a bus controller for the storage type. Those skilled in the art will recognize other functions of the NAND bus controller 428.

Flow chart

FIG. 5 is a schematic flow chart diagram illustrating one embodiment of a method 500 for managing data in a solid-state storage device 102 using a data pipeline in accordance with the present invention. The method 500 begins at step 502 with the input buffer 306 receiving one or more data segments to be written to the solid-state memory 110 (step 504). Typically, the one or more data segments comprise at least a portion of an object, but may be the entire object. The wrapper 302 may create one or more object specific packages and objects. Packetizer 302 adds a packet header to each packet, which typically includes the length of the packet and the sequence number within the object. Packetizer 302 receives one or more data or metadata segments stored in input buffer 306 (step 504) and packetizes the one or more data or metadata segments by creating one or more packets sized for solid-state memory 110 (step 506), wherein each packet includes a packet header and data from the one or more segments.

Typically, the first package includes an object identifier that identifies the object for which the package is created. The second packet may include a packet header with information used by the solid-state storage device 102 to associate the second packet with the object identified in the first packet, the packet header also having offset information and data to locate the second packet within the object. The solid state storage device controller 202 manages the banks 214 and the physical area to which packets flow.

The ECC generator 304 receives the packets from the packetizer 302 and generates an ECC for the data packets (step 508). Typically, there is no fixed relationship between the packet and the ECC block. An ECC block may include one or more packets. A packet may include one or more ECC blocks. A packet may start anywhere within an ECC block and may end anywhere within an ECC block. A packet may start anywhere within a first ECC block and may end anywhere in a subsequent ECC block.

The write sync buffer 308 buffers the packets distributed in the corresponding ECC block before writing the ECC block to the solid-state memory 110 (step 510), and then the solid-state memory controller 104 writes the data at the appropriate time to account for the clock domain differences (step 512), with the method 500 terminating at step 514. The write synchronization buffer 308 is located on the boundary of the local clock domain and the solid-state memory 110 clock domain. Note that for convenience, method 500 describes receiving one or more data segments and writing one or more data packets, but typically receives a stream or group of data segments. Typically, a number of ECC blocks comprising a virtual page of the complete solid state memory 110 are written to the solid state memory 110. In general, packetizer 302 receives a segment of data of a certain size and generates packets of another size. This necessarily requires that the data or metadata segments or portions of the data or metadata segments be combined to form a data packet that captures all of the data of the segments into a packet.

FIG. 5B is a schematic flow chart diagram illustrating one embodiment of a method for an in-server SAN in accordance with the present invention. The method 500 begins (step 552) with the storage communication module 162 facilitating communication between the first storage controller 152a and at least one device external to the first server 112a (step 554). The communication between the first storage controller 152a and the external device is independent of the first server 112 a. The first storage controller 112a resides within the first server 112 and the first storage controller 152a controls at least one storage device 154 a. The first server 112a includes a network interface 156a that is coupled to the first server 112a and the first storage controller 152 a. The in-server SAN module 164 transmits the storage request (step 556) and the method 501 ends at step 558. The in-server SAN module communicates the storage request using a network protocol and/or a bus protocol (step 556). The in-server SAN module 164 transmits the storage request independently of the first server 112a, with the transmitted request received by the client 114, 114a (step 556).

FIG. 6 is a schematic flow chart diagram illustrating yet another embodiment of a method 600 for managing data in a solid-state storage device 102 using a data pipeline in accordance with the present invention. The method 600 begins at step 602, where the input buffer 306 receives one or more data or metadata segments to be written to the solid-state memory 110 (step 604). Packetizer 302 adds a header to each packet, which typically includes the length of the packet within the object. Packetizer 302 receives one or more segments stored in input buffer 306 (step 604) and packetizes the one or more segments by creating one or more packets sized for solid-state memory 110 (step 606), wherein each packet includes a header and data from the one or more segments.

The ECC generator 304 receives the packets from the packetizer 302 and generates one or more ECC blocks for the packets (step 608). The write sync buffer 308 buffers the packets distributed in the corresponding ECC block before writing the ECC block to the solid-state memory 110 (step 610), and then the solid-state storage controller 104 writes the data at the appropriate time to account for the clock domain differences (step 612). When data is requested from the solid state memory 110, an ECC block comprising one or more data packets is read into the read sync buffer 328 and buffered (step 614). The ECC block of the packet is received over the memory I/O bus 210. Since the storage I/O bus 210 is bidirectional, when data is read, a write operation, a command operation, and the like are stopped.

The ECC error correction module 322 receives the ECC blocks of the request packet buffered in the read sync buffer 328 and corrects errors in each ECC block if necessary (step 616). If the ECC error correction module 322 determines that one or more errors exist in the ECC block and the errors are correctable using the ECC together, the ECC error correction module 322 corrects the errors in the ECC block (step 616). If the ECC error correction module 322 determines that the detected error is not correctable with ECC, the ECC error correction module 322 sends an interrupt.

The depacketizer 324 receives the requested packets after the ECC error correction module 322 corrects any errors (step 618) and depacketizes each packet by checking and deleting its packet header (step 618). Alignment module 326 receives the unpacked packets, removes the excess data, and reformats the data in the data segment of the object in a form compatible with the device requesting the data segment (step 620). The input buffer 330 receives the unpacked request packet and buffers the packet between transmission to the requesting device (step 622), with the method 600 terminating at step 624.

Fig. 7 is a schematic flow chart diagram illustrating one embodiment of a method 700 for managing data in a solid-state storage device 102 using bank interleaving in accordance with the present invention. The method 700 begins (step 702) with the bank interleave controller 344 sending one or more commands to two or more queues 410, 412, 414, 416 (step 704). Typically the agents 402, 404, 406, 408 send commands to the queues 410, 412, 414, 416 based on the command type (step 704). Each set of queues 410, 412, 414, 416 includes a queue for a respective command type. The bank interleave module 344 coordinates execution of the commands stored in the queues 410, 412, 414, 416 in the banks 214 (step 706) so that the first type of command is executed in the first bank 214a while the second type of command is executed in the second bank 214 b. The method 700 ends (step 708).

Storage space recovery

FIG. 8 is a schematic block diagram illustrating one embodiment of an apparatus 800 for garbage collection in a solid-state storage device 102 in accordance with the present invention. The apparatus 800 includes a sequential storage module 802, a storage section selection module 804, a data recovery module 806, and a storage section recovery module 808, which are described below. In another embodiment, the apparatus 800 includes a garbage marking module 810 and an erasing module 812.

The apparatus 800 includes a sequential storage module 802, the sequential storage module 802 sequentially writing data packets in a page within a memory portion. Whether the packet is a new packet or a modified packet, the packets are stored sequentially. In this embodiment, the modified packet passes through the location where the packet was previously stored without being written to. In one embodiment, the sequential storage module 802 writes the packet to a first location in a page of the memory portion, then to a subsequent location in the page, and then to the subsequent location, until the page is filled. The sequential storage module 802 then begins filling the next page in the storage. This continues until the storage is filled.

In a preferred embodiment, the sequential storage module 802 starts writing packets to a storage write buffer of storage elements (e.g., SSS 0.0 to SSS m.0216) of a bank (bank-0214 a). When the storage write buffer is full, the solid-state storage controller 104 causes the data in the storage write buffer to be programmed to a specified page in the storage element 216 of the memory bank 214 a. Another bank (e.g., bank-1214 b) is then selected, and the sequential storage module 802 begins writing packets to the storage write buffers of the storage elements 218 of bank 214b when the first bank-0 programs the specified page. When the storage write buffer of the bank 214b is full, the contents of the storage write buffer are programmed to another designated page of each storage element 218. The process is very efficient because the memory write buffer of one bank 214b can be filled while another bank 214a programs a page.

The storage section includes a portion of the solid-state memory 110 in the solid-state storage device 102. Typically the storage section is an erase block. For flash memory, an erase operation on an erase block writes a 1 to each bit of the erase block by charging each cell. This is a tedious process compared to a programming operation (starting from locations that are all 1's), some bits becoming 0's by discharging the cells written to 0's when writing data. However, when the solid-state memory 110 is not a flash memory or has a flash memory (the same for an erase cycle as for other operations (reading or programming)), it is not necessary to erase the memory portion.

As used herein, a memory portion is the same area as an erase block but may or may not be erased. As used herein, an erase block may be a particular region of a specified size in a storage element (e.g., SSS 0.0216a), typically including a particular number of pages. When an "erase block" is used in conjunction with flash memory, the erase block is typically the portion of the memory that is erased prior to writing. When used in conjunction with "solid state memory," an erase block may or may not be erased. As used herein, an erase block may include an erase block or a group of erase blocks, where an erase block is located in each row of storage elements (e.g., SSS0.0 to SSS M.0216a-n), and reference herein to an erase block may also be referred to as a virtual erase block. When referring to the logical structure associated with a virtual erase block, the erase block may be referred to herein as a logical erase block ("LEB").

Typically, packets are stored sequentially in order of processing. In one embodiment, when the write data pipe 106 is used, the sequential storage module 802 stores the packets in the order in which they left the write data pipe 106. The order depends on the result of the packet mixing of the data segment from the requesting device 155 and the valid data, which is read from the other storage section when the valid data is restored from the storage section in the restoration operation described below. Resuming rerouting, valid data packets may include a garbage collector bypass 316 between the write data pipes 106, the garbage collector bypass 316 being described above in connection with the solid state storage controller 104 of fig. 3.

The apparatus 800 includes a storage section selection module 804 that selects a storage section for recovery. Selecting a storage for recovery may be adding the recovered storage to the storage pool by reusing the storage by the sequential storage module 802 to write data, or recovering valid data from the storage after the storage is temporarily or permanently removed from the storage pool due to a determination that the storage is faulty, unreliable, should be updated, or other reasons. In another embodiment, the storage section selection module 804 selects a storage section for recovery by identifying a storage section or an erase block having a large amount of invalid data.

In another embodiment, the storage section selection module 804 selects a storage section for recovery by identifying a storage section or an erase block with less damage. For example, identifying a less defective memory portion or erase block may include identifying a memory portion with less invalid data, a short erase cycle, a low bit error rate, or a low program count (a page of data in a buffer is written to a page of a memory portion a few times; a program count may be measured based on when the device is configured, when the memory portion is last erased, any other event, and combinations thereof). The storage selection module 804 may also determine a less damaged storage using any combination of the above or other parameters. By identifying a storage unit with less damage and selecting a storage unit for recovery, it is possible to appropriately find a storage unit that is not fully utilized and has a recoverable damage level.

In another embodiment, the storage unit selection module 804 selects a storage unit for recovery by identifying a storage unit or an erase block that is damaged more. For example, identifying a more defective memory portion or erase block may include identifying a memory portion with a long memory period, a high bit error rate, a non-recoverable ECC block, or a memory portion with a high program count. The storage selection module 804 may also determine a more damaged storage using any combination of the above or other parameters. By identifying the storage units that are damaged more and selecting the storage units for restoration, it is possible to find out the storage units that are used too sufficiently, to restore the storage units using an erase cycle (by updating the storage units), or to take out of service when the storage units are not usable.

The apparatus 800 includes a data recovery module 806, and the data recovery module 806 reads the valid packet from the selected recovery storage unit, queues the valid packet with another packet (to be sequentially written by the sequential storage module 802), and updates the index with a new physical address of the valid data (written by the sequential storage module 802). Typically, the index is an object index that maps the data object identifier of an object to a physical address at which the resulting packet of data objects is stored in solid state memory 110.

In one embodiment, the apparatus 800 includes a storage recovery module 808, wherein the storage recovery module 808 prepares to use or reuse a storage and marks the storage as available to the sequential storage module 802 for sequential writing of data packets after the data recovery module 806 copies valid data from the storage. In another embodiment, the apparatus 800 includes a storage recovery module 808, and the storage recovery module 808 marks the selected storage for recovery as unavailable to store data. Typically, this is because the storage section selection module 804 identifies storage sections or erase blocks that are more defective and therefore do not have conditions for reliable data storage.

In one implementation, the apparatus 800 is a solid-state storage device controller 202 of the solid-state storage device 102. In another embodiment, the apparatus 800 controls a solid-state storage device controller 202. In another embodiment, a portion of the apparatus 800 is located in the solid-state storage device controller 202. In another embodiment, the object index updated by the data recovery module 806 is also located in the solid-state storage device controller 202.

In one embodiment, the storage is an erase block, and the apparatus 800 includes an erase module 810, the erase module 810 erasing the selected erase block for recovery after the data recovery module 806 copies valid packets from the selected erase block and before the storage recovery module 808 marks the erase block as usable. For flash or other solid-state memory that takes longer to erase than to read or write, efficient operation requires that the erase operation be performed before the data block is made available for writing new data. When the solid-state storage 110 is disposed in the banks 214, the erase module 810 may perform erase operations on one bank while other banks perform read, write, or other operations.

In one embodiment, the apparatus 800 includes a garbage mark module 812, the garbage mark module 812 authenticating that a packet in the storage is invalid in response to an operation indicating that the packet is invalid. For example, if a packet is deleted, the spam tagging module 812 can identify the packet as invalid. A read-modify-write operation is another way to identify a packet as invalid. In one embodiment, the spam tagging module 812 identifies the invalidity of the packet by updating an index. In another embodiment, the spam tagging module 812 may identify another packet as invalid by storing a packet (indicating that an invalid packet has been deleted). Advantageously, the information of deleted packets stored in the solid state memory 110 enables the object index reconstruction module 262 or similar module to reconstruct an object index having an entry indicating that an invalid packet has been deleted.

In one embodiment, the apparatus 800 may be used to fill the remainder of the virtual pages of data according to a refresh command to improve overall performance, where the refresh command prevents data from flowing into the write pipe 106 before the write pipe 106 is emptied, and all packets are permanently written to the non-volatile solid state memory 110. This has the advantage of reducing the amount of garbage collection required, the time it takes to erase the memory and the time it takes to program the virtual pages. For example, the refresh command may be received while preparing only one small packet (to be written to a virtual page of the solid-state memory 100). Programming the virtual pages that are nearly empty may entail the need to immediately restore wasted space, causing valid data within the storage to be unnecessarily garbage collected, resulting in the storage being erased, restored, and returned to the pool of valid space (for writing by sequential storage module 802).

As described above, for flash memory and other similar memories, the erase operation takes a significant amount of time, so marking packets as invalid is more efficient than actually erasing invalid packets. As illustrated in the apparatus 800, the garbage collection system is allowed to run autonomously within the solid state memory 110 separating erase operations from read, write and other faster operations, and thus the solid state storage device 102 runs faster than many other solid state storage systems or data storage devices.

FIG. 9 is a schematic flow chart diagram illustrating one embodiment of a method 900 for memory recovery in accordance with the present invention. The method 900 begins (step 902) and the sequential storage module 802 writes the packets sequentially to the storage portion (step 904). The storage is part of the solid-state memory 110 in the solid-state storage device 102. Typically, the storage section is an erase block. The data packets are obtained from the object and stored sequentially according to the processing order.

The storage section selection module 804 selects a recovery storage section (step 906), and the data recovery module 806 reads the valid packet from the selected recovery storage section (step 908). Typically a valid packet is a packet that has not been marked as erased or deleted or other invalid data marked and is considered data valid or "good". The data recovery module 806 queues the valid data packet with other data packets already queued, which are to be written sequentially by the sequential storage module 802 (step 910). The data recovery module 806 updates the index with the new physical address of the valid data written by the sequential storage module 802 (step 912). The index includes a mapping of physical addresses of packets to object identifiers. The data packet is a data packet stored in the solid-state memory 110, and the object identifier corresponds to the data packet.

After the data recovery module 806 has copied the valid data from the storage, the storage recovery module 808 marks the selected storage for recovery as available to the sequential storage module 802 for sequential writing to the data packet (step 914), and the method 900 ends (step 916).

Progressive RAID

FIG. 10 is a schematic block diagram illustrating one embodiment of a system 1600 for progressive RAID in accordance with the present invention. The system 1600 includes N storage devices 150 and M parity-mirror storage devices 1602 accessible by one or more clients 114 over a computer network. The N storage devices 150 and the parity-mirror storage device 1602 may be located in one or more servers 112. The storage device 150, server 112, computer network 116, and client 114 are substantially the same as described above. The parity-mirror storage device 1602 is generally similar or identical to the N storage devices 150 and is generally designated as the parity-mirror storage device 1602 for the stripe.

In one embodiment, the N storage devices 150 and the M parity-mirror storage devices 1602 are included in or accessible by one server 112, or both may be networked using a system bus. In another embodiment, the N storage devices 150 and the M parity-mirror storage devices 1602 are included in or accessible through one server 112a-N + M. For example, the storage device 150 and the parity-mirror storage device 1602 may be part of the intra-server SAN described above in connection with the system 103 of fig. 1C and the method 105 of fig. 5B.

In one embodiment, the parity-mirror storage device 1602 stores all of the parity data segments of the stripes stored in the progressive RAID. In another preferred embodiment, the storage devices 150 assigned to the storage device set 1604 of the progressive RAID are assigned as parity-mirror storage devices 1602 for a particular stripe, rotating the assignments so that the parity data segments of each stripe rotate among the N + M storage devices 150. The performance advantage of this embodiment is over the parity-mirror storage device 1602 which assigns a single storage device 150 as each stripe. By rotating the parity data segments, the overhead associated with computing and storing the parity data segments may be distributed.

In one embodiment, the storage devices 150 are solid state storage devices 102, wherein each solid state storage device 102 has an associated solid state memory 110 and solid state storage controller 104. In another embodiment, each storage device 150 includes a solid state storage controller 104 and associated solid state storage 110, the solid state storage 110 acting as a cache for other low cost, low performance storage (such as tape storage or hard disk drives). In another embodiment, one or more servers 112 include one or more clients 114 that send storage requests to progressive RAID. Those skilled in the art will recognize that the progressive RAID configuration may be configured in other system configurations having N storage devices 150 and one or more parity-mirror storage device 1602 configurations.

FIG. 11 is a schematic block diagram illustrating one embodiment of an apparatus 1700 for progressive RAID in accordance with the present invention. In various embodiments, the apparatus 1700 includes a storage request receiving module 1702, a striping module 1704, a parity-mirror module 1706, a parity progression module 1708, a parity rotation module 1710, a mirror set module 1712, an update module 1714, a mirror repair module 1716 with a direct client response module 1718, a pre-consolidation module 1720, a post-consolidation module 1722, a data reconstruction module 1724, and a parity reconstruction module 1726, which are described below. Module 1702 + 1726 is depicted as being located within server 112, but some or all of the functionality of module 1702 + 1726 may be distributed across multiple servers 112, storage controllers 152, storage devices 150, and clients 114.

The apparatus 1700 includes a storage request receiving module 1702 that receives a data storage request, wherein the data is data of a file or data of an object. In one embodiment, the storage request is an object request. In another embodiment, the storage request is a chunk store request. The storage request in one embodiment does not include data, but includes commands (DMA or RDMA data from the client 114 or other source) used by the storage device 150 and the parity-mirror storage device 1602. In another embodiment, the storage request includes data to be stored as a result of the storage request. In another embodiment, the storage request includes a command that enables data to be stored to the set of storage devices 1604. In another embodiment, the storage request includes a plurality of commands. Those skilled in the art will recognize that other storage requests to store data are also suitable for progressive RAID.

The data is stored at a location accessible to the device 1700. In one embodiment, data is available in random access memory ("RAM"), such as that used by the client 114 or the server. In another embodiment, the data is stored on a hard disk drive, tape storage, or other mass storage device. In one embodiment, the data is configured as an object or file. In another embodiment, the data is configured as data blocks (part of an object or file). Those skilled in the art will recognize that other forms and locations of data are the subject of the storage request.

The apparatus 1700 includes a striping module 1704 that computes a stripe form of data. The strip form includes one or more strips, where each strip includes a set of N data segments. Typically the number of data segments in a stripe depends on how many storage devices 150 are allocated to the RAID group. For example, if RAID5 is used, one storage device 150 is allocated as a parity-mirror storage device 1602a to store parity data for a particular stripe. If four other storage devices 150a, 150b, 150c, 150d are all allocated to a RAID group, then the stripe will have four data segments in addition to the parity data segments. The striping module 1704 writes the N data segments of the stripe to the N storage devices 150a-N, such that each of the N data segments is written to a different storage device 150a, 150b …, 150N in the set 1604 of storage devices 150 assigned to the stripe. Those skilled in the art will appreciate the various combinations of storage devices 150 assigned to a RAID group for a particular RAID level, as well as how to generate striping shapes and how to divide the data into N data segments per stripe.

The apparatus 1700 includes a parity-mirror module 1706 that writes a set of N data segments of a stripe to one or more parity-mirror storage devices 1602 in a storage device set 1604, wherein the parity-mirror storage devices 1602 are devices other than the N storage devices 150. The N data segments are used for subsequent calculations of subsequent parity data segments. The parity-mirror module 1706 copies the N sets of data segments to the parity-mirror storage device 1602 without immediately computing the parity data segments, the time required for copying being less than the time required to store the N data segments. After the N data segments are stored to the parity-mirror storage device 1602, the N data segments may still be effectively read or used to repair data even if one of the N storage devices 150 is unavailable. An advantage in reading data in a RAID0 configuration is that all N data segments are available from one storage device (e.g., 1602 a). For more than one parity-mirror storage device (e.g., 1602a, 1602b), the parity-mirror module 1706 copies N data segments for each of the parity-mirror storage devices 1602a, 1602 b.

The apparatus 1700 includes a parity progression module 1708 that computes one or more parity data segments for a stripe in response to a storage consolidation operation. One or more parity data segments computed from the N data segments are stored in the parity-mirror storage device 1602. The parity progression module 1708 stores the parity data segments into each of the one or more parity-mirror storage devices 1602. The storage consolidation operation is used to recover at least storage space and/or data of at least one of the one or more parity-mirror storage devices 1602. For example, the storage consolidation operation may be data garbage collection of the solid-state storage device 102 described above (in relation to the apparatus 800 and method 900 of fig. 8 and 9). The storage consolidation operations may also include defragmentation operations for hard disk drives or other similar operations that consolidate data to increase storage space. Storage consolidation operations, as used herein, may also include operations to recover data, such as operations to recover from errors when the storage device 150 is unavailable, or operations to read data from the parity-mirror storage device 1602 for other reasons. In another embodiment, the parity progression module 1708 computes only parity data segments when the parity-mirror storage device 1602 is not too busy.

Advantageously, by delaying the computation and storage of the parity data segments of the stripe before the parity-mirror storage device 1602 requires more storage space or for other reasons of storage consolidation operations, the N data segments of the parity-mirror storage device 1602 may be used to read data segments, restore data, reconstruct data on the storage device 150. The parity progression module 1708 may operate autonomously in a background operation according to the operation of the storage request receipt module 1702, the striping module 1704, or the parity-mirror module 1706. Those skilled in the art will recognize other reasons for delaying the computation of parity data segments (as part of a progressive RAID operation).

In one embodiment, some or all of the functions of module 1702 and 1708 are implemented within the storage devices 150, clients 114, and third party RAID management devices of the storage device set 1604, including: receiving a data storage request, computing a stripe shape and writing N data segments to N storage devices, writing a set of N data segments to a parity-mirror storage device, and computing a parity data segment. The third party RAID management device may be a server 114 or other computer.

In one embodiment, the apparatus 1700 includes a parity replacement module 1710 that alternates assignment of storage devices 150 in the storage device set 1604 (for each stripe) as one or more parity-mirror storage devices 1602 for the stripe. As described above with respect to the system 1600 of FIG. 10, the computational effort of multiple parity-data segments is spread across the storage devices 150 of the storage device set 1604 by rotating the storage devices 150 used in the parity-mirrored storage devices of the stripes.

In another embodiment, the storage device set 1604 is a first storage device set, and the apparatus 1700 includes a mirror set module 1712, the mirror set module 1712 generating one or more storage device sets other than the first storage device set 1604 such that each of the one or more storage device sets includes at least one associated striping module 1704 to write the N data segments to the N storage devices 150 of each of the one or more additional storage device sets. In a related embodiment, each of the one or more additional storage device sets includes an associated parity-mirror module 1706 for storing a set of N data segments and a parity progression module 1708 for computing one or more parity data segments. Where the mirrored set module 1712 generates one or more sets of mirrored storage devices, the RAID may be a multiple RAID, such as RAID 50. In this embodiment, the RAID levels may be progressive from RAID10 (of striped and mirrored data) to RAID50 or 60 (of computing parity data segments and storing for each storage data set 1604).

In one embodiment, the apparatus 1700 includes an update module 1714. Update module 1714 is typically used when the N data segments of parity-mirror storage device 1602 are not becoming parity data segments. The update module 1714 receives an updated data segment, wherein the updated data segment corresponds to an existing data segment of the N data segments stored in the N storage devices 150. The update module 1714 copies the updated data segment to the storage device 150 of the stripe that stores the existing data segment, and also to the one or more parity-mirror storage devices 1602 of the stripe. The update module 1714 replaces existing data segments stored in a storage device 150 of the N storage devices 150a-N with update data segments, and replaces existing data segments stored in one or more parity-mirror storage devices 1602 with update data segments.

In one embodiment, replacing a data segment includes writing the data segment to the storage device 150 and marking the corresponding data segment as invalid for subsequent garbage collection. An example of this embodiment has been described in the solid-state memory 110 and garbage collection device description above in relation to fig. 8 and 9. In another embodiment, replacing the data segment includes overwriting the existing data segment with the updated data segment.

In one embodiment, the storage device set 1604 is a first storage device set, and the apparatus 1700 includes an image repair module 1716 that restores data segments stored in the storage devices 150 of the first storage device set 1604 when the storage devices 150 of the first storage device set 1604 are unavailable. The data segment is recovered from a mirrored storage device containing a copy of the data segment. The mirrored storage device includes one of a set of one or more storage devices 150 that store N copies of the data segments.

In yet another embodiment, the image repair module 1716 restores the data segment in response to a read request from the client 114 to read the data segment. In another related embodiment, the image repair module 1716 further includes a direct client response module 1718, the direct client response module 1718 sending the requested data segment from the image storage device to the client 114. In this embodiment, the requested data segment is copied to the client 114, so the client 114 does not need to wait until the data segment is restored before transmitting the data segment to the client 114.

In one embodiment, the apparatus 1700 includes a pre-consolidation repair module 1720 that restores data segments stored in the storage devices 150 of the storage device set 1604 in response to a request to read a data segment. In this embodiment, the storage device 150 is not available and the parity progression module 1708 recovers one or more parity data segments from the parity-mirror storage device 1602 before the data segments are generated by the one or more parity-mirror storage devices 1602.

In another embodiment, the apparatus 1700 includes a post-consolidation repair module 1724, the post-consolidation repair module 1724 restoring data segments stored in the storage devices 150 of the storage device set. In one embodiment, when the storage device 150 is unavailable, after the parity progression module 1708 generates one or more parity data segments, the data segments are recovered using one or more parity data segments stored within one or more parity mirror storage devices 150. For example, post-consolidation repair module 1724 retrieves the missing data segment using the parity data segments and the available data segments within the available N storage devices 150.

In one embodiment, the apparatus 1700 includes a data reconstruction module 1724, the data reconstruction module 1724 storing the recovered data segments to the replacement storage device in a reconstruction operation, wherein the recovered data segments match the unavailable data segments stored on the unavailable storage device 150. The unavailable storage device 150 is one of the N storage devices 150 of the storage device set 1602. Typically, the rebuild operation occurs after a failure of the storage device 150 (which stores the unavailable data segment). The rebuild operation is used to store data segments in the replacement storage device to match data segments previously stored on the unavailable storage device 150.

The reconstruction operation may recover data segments from multiple sources. For example, if the matching data segment is located in the parity-mirror storage device 1602, the data segment may be recovered from the parity-mirror storage device 1602 before progressing. In another example, a data segment may be recovered from a mirrored storage device containing a copy of the data segment that is not available. Data segments are typically recovered from the mirrored storage device if they are not within one or more parity-mirror storage devices 1602, but may be recovered from the mirrored storage device even if matching data segments are available in the mirrored storage device.

In another embodiment, if the recovered data segment is not located in parity-mirror storage 1604 or the mirror storage, the regenerated data segment is regenerated based on the one or more parity data segments and the available data segments of the N data segments. Missing data segments are typically regenerated if no data segments of the same format are present in the other storage device 150.

In another embodiment, the apparatus 1700 includes a parity reconstruction module 1726 that reconstructs a recovered parity data segment on a replacement storage device through a parity reconstruction operation when the recovered parity data segment matches an unavailable parity data segment stored within an unavailable parity-mirror storage device. The unavailable parity storage device is one of the one or more parity-mirror storage devices 1602. The parity rebuild operation restores the parity data segments to the replacement storage device to match the parity data segments previously stored in the unavailable parity-mirror storage device.

To regenerate the recovered parity data segments in a reconstruction operation, the data used for reconstruction may come from multiple sources. In one example, the recovered parity data segments are recovered using parity data segments stored within the parity-mirror storage device 1602 (storing the mirror copy of the stripe) of the second set of storage devices 150. When a mirrored copy is available, the use of the mirrored parity data segment is required since the recovered parity data segment does not have to be calculated. In another embodiment, if N data segments within the N storage devices are available, the recovered parity data segments are regenerated from the N data segments stored by one of the N storage devices 150. Typically, when a single failure occurs in the reconstructed parity-mirror storage device 1602, N data segments within the N storage devices 150 are available.

In another embodiment, if one or more of the N data segments are not available from the N storage devices 150 of the first storage device set 1604 and a matching parity data segment is not available at the second storage device 150 set, the recovered parity data segment is regenerated from the one or more storage devices 150 of the second storage device 150 set (that store copies of the N data segments). In yet another embodiment, the recovered parity data segments are generated from the available data segments and the non-matching parity data segments (regardless of where the available data segments and the non-matching parity data segments are located on the set of one or more storage devices 150).

The data reconstruction module 1724 and the parity reconstruction module 1726 are typically used together to reconstruct data segments and parity data segments within the reconstructed storage device 150 when parity-mirrored storage devices are alternated among the storage devices 150 of the storage device set 1604. When the second parity-mirror storage device 1602b is available, the data reconstruction module 1724 and the parity reconstruction module 1726 can reconstruct both storage devices 150, 1602 of the storage device set 1604 after failure. When the parity-mirror storage device 1602 does not continue to generate parity-mirror data segments, the recovery speed of the data segments or the storage device 150 is faster than the recovery speed of parity data segments for stripes that have been computed and stored and the recovery speed of N data segments within the parity-mirror storage device 1602 that have been deleted for computing parity data segments if the parity-mirror storage device 1602 is upgraded.

FIG. 12 is a schematic block diagram illustrating one embodiment of an apparatus 1800 for updating data segments using progressive RAID in accordance with the present invention. Generally, apparatus 1800 belongs to a RAID group in which one or more parity-mirrored storage devices have been upgraded and includes parity data segments, excluding N data segments used to generate the parity data segments. The apparatus 1800 includes an update receiving module 1802, an update replication module 1804, and a parity update module 1806, which are described below. The modules 1802 and 1806 of the apparatus 1800 are depicted as being located in the server 112, but may also be located in the storage device 150, the client 114, a combination of these devices, or may be distributed across multiple devices.

The stripes, data segments, storage devices 150, storage device sets 1604, parity data segments, and one or more parity-mirror storage devices 1602 are substantially identical to the stripes described above in connection with apparatus 1700 of FIG. 11. The apparatus 1800 includes an update receiving module 1802, the update receiving module 1802 receiving an updated data segment, wherein the updated data segment corresponds to an existing data segment of an existing stripe. In another embodiment, the update receiving module 1802 may accept multiple updates or may process the updates together or separately.

The apparatus 1800 includes an update copy module 1804 that copies updated data segments to the storage device 150 that stores corresponding existing data segments and also to one or more parity-mirror storage devices 1602 that correspond to existing stripes. In another embodiment, the update replication module 1804 replicates updated data segments to either the parity-mirror storage device 1602 or to the storage device 150 that stores existing data segments, and then proves that copies of the updated data segments are forwarded to other devices 1602, 150.

The apparatus 1800 includes a parity update module 1806, which parity update module 1806 computes one or more updated parity data segments for one or more parity-mirrored storage devices of an existing stripe in response to a storage consolidation operation. The memory consolidation operations are the same as those described above in connection with the apparatus 1700 of FIG. 11. The storage consolidation operation recovers at least the storage space and/or data in the one or more parity-mirror storage devices 1602 using the one or more updated parity data segments. By waiting for the update of one or more parity data segments, the update is deferred until more convenient or until more necessary for the overall storage space.

In one embodiment, the updated parity data segment is calculated from the existing parity data segment, the updated data segment, and the existing data segment. In one embodiment, the existing data segment is kept intact until the existing data segment used to generate the updated parity data segment is read. One advantage of this embodiment is that the overhead associated with copying an existing data segment to the parity-mirror storage device 1602 or other location (where an updated parity data segment is generated) may be deferred if not necessary. One disadvantage of this embodiment is that if the storage device 150 that holds an existing data segment fails, the existing data segment must be restored before an updated parity data segment can be generated.

In another embodiment, when a storage device 150 of the N storage devices 150a-N (storing an existing data segment) receives a copy of an updated data segment, the existing data segment is copied to the data-mirror storage device 1602. Existing data segments are stored until the store consolidation operation. In another embodiment, if a storage consolidation operation of a storage device 150 of the N storage devices 150a-N (storing an existing data segment) occurs before a storage consolidation operation causing the updated parity data segment to be computed occurs, the existing data segment is copied to the data-mirrored storage device 1602 in response to the storage consolidation operation of the storage device 150. The latter embodiment is more advantageous because the existing data segments are not copied until required by the storage consolidation operations of the other storage devices 150 (that store the existing data segments) or the parity-mirror storage device 1602.

In one embodiment, the updated parity data segment is calculated from an existing parity data segment, an updated data segment, and a delta data segment, wherein the delta data segment is the difference between the updated data segment and the existing data segment. Typically, generating the delta data segment is a partial scheme or an intermediate step in updating the parity data segment. The advantage of generating delta data segments is that they are very compressible and can be compressed before transmission.

In one embodiment, the delta data segment is stored in a storage device that stores the existing data segment prior to reading the delta data segment to generate the updated parity data segment. In another embodiment, the delta data segment is copied to the data-mirror storage device 1602 when the storage device 150 storing the existing data segment receives a copy of the updated data segment. In another embodiment, the delta data segment is copied to the data-mirror storage device 1602 in response to a storage consolidation operation of the storage device 150 storing the existing data segment. The latter embodiment is advantageous when replicating an existing data segment because the delta data file does not move until execution, whether a storage consolidation operation of the storage device 150 that stores the existing data segment is performed earlier or another storage consolidation operation that causes calculation of the updated parity data segment is performed earlier.

In various embodiments, all of the partial operations of the modules 1802, 1804, 1806 (i.e., receiving updated data segments, replicating updated data segments, and computing updated parity data segments) are performed in the storage devices 150 of the storage device set 1604, the clients 114, or third party RAID management devices. In another embodiment, the storage consolidation operations occur autonomously according to the operation of the update reception module 1802 and the update replication module 1804.

FIG. 13 is a schematic flow chart diagram illustrating one embodiment of a method 1900 for managing data using progressive RAID processing in accordance with the present invention. The method 1900 begins (step 1902) and the storage request receiving module 1702 receives a data storage request (step 1904), wherein the data is data of a file or data of an object. The striping module 1704 computes a stripe shape for the data and writes N data segments to N storage devices 150 (step 1906). The strip shape includes one or more strips. Each stripe includes a set of N data segments, where each set of data segments is written to a different storage device 150 of the set of storage devices 1604 assigned to the stripe.

The parity-proceeding module 1706 writes the N sets of data segments of the stripe to one or more parity-mirror storage devices 1602 within the storage device set 1604 (step 1908). The one or more parity-mirror storage devices are devices other than the N storage devices 150 a-N. The parity generation module 1708 determines if there is an impending storage consolidation operation (step 1910). If the parity generation module 1708 determines that there is no pending storage consolidation operation, the method 1900 returns to continue to determine whether there is a pending storage consolidation operation (step 1910). In another embodiment, the storage request receipt module 1702, striping module 1704, and parity-mirror module 1706 continue to receive storage requests, compute stripe shapes, and store data segments.

If the parity generation module 1708 determines that there is an impending storage consolidation operation (step 1910), the parity generation module 1708 calculates the parity data segment of the stripe (step 1912). The parity data segments are computed from the N data segments stored in the parity-mirror storage device 1602. The parity generation module 1708 stores the parity data segment in the parity-mirror storage device 1602 (step 1912) and the method 1900 terminates (step 1914). The storage consolidation operation is performed autonomously, depending on whether a request to store N data segments is received (step 1904), N data segments are written to N storage devices (step 1906), or N data segments are written to one or more parity-mirror storage devices (step 1908). A storage consolidation operation is performed to restore at least the storage space or data of the parity-mirror storage device 1602.

FIG. 14 is a schematic flow chart diagram illustrating one embodiment of a method 2000 for updating data segments using progressive RAID processing in accordance with the present invention. The method 2000 begins (step 2002) and the update receiving module 1802 receives an updated data segment (step 2004), wherein the updated data segment corresponds to an existing data segment of an existing stripe. The update copy module 1804 copies the updated data segments to the storage device 150 (which stores the corresponding existing data segments) and also copies the updated data segments to one or more parity-mirror storage devices 1602 (which correspond to the existing stripes) (step 2006).

The parity update module 1806 determines whether there is an impending storage consolidation operation (step 2008). If the parity update module 1806 determines that there is no pending storage consolidation operation, the parity update module 1806 waits for a storage consolidation operation (step 2010). In one embodiment, the method 2000 returns, receives additional updated data segments (step 2004) and copies the updated data segments (step 2006). If the parity update module 1806 determines that there is an impending storage consolidation operation, the parity update module 1806 computes one or more updated parity data segments for the parity-mirror storage device of the existing stripe (step 2010) and the method 2000 ends (step 2012).

The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

1. An apparatus for reliably storing data with high performance, the apparatus comprising:

the storage request receiving module receives a data storage request, wherein the data comprises data of a file or data of an object;

a striping module that computes a stripe shape for data, the stripe shape containing one or more stripes, each stripe comprising a set of N data segments, and that writes the N data segments of a stripe to N storage devices, wherein each of the N data segments is written to a different storage device in the set of storage devices assigned to the stripe;

a parity-mirror module to write a set of N data segments of a stripe to one or more parity-mirror storage devices of a set of storage devices, the parity-mirror storage devices being devices other than the N storage devices; and

a parity progression module to compute one or more parity data segments of a stripe, the one or more parity data segments computed from N data segments stored in one or more parity-mirror storage devices, the parity progression module to further store the parity data segments to each of the one or more parity-mirror storage devices, the storage consolidation operation to recover at least one storage space and data of at least one of the one or more parity-mirror storage devices.

2. The apparatus of claim 1, further comprising a parity rotation module that alternately allocates storage devices within the storage device set of each stripe as one or more parity-mirrored storage devices of the stripe.

3. The apparatus of claim 1, wherein the storage consolidation operations are performed autonomously with respect to storage operations of the storage receive module, the striping module, and the parity-mirror module.

4. The apparatus of claim 1, wherein the set of storage devices comprises a first set of storage devices, further comprising a mirror set module that generates one or more sets of storage devices other than the first set of storage devices, wherein each of the one or more additional sets of storage devices comprises at least one associated striping module that writes the N data segments to the N storage devices of each of the one or more additional sets of storage devices.

5. The apparatus of claim 4, wherein each of the one or more additional sets of storage devices comprises an associated parity-mirror module that stores the N sets of data segments and a parity progression module that computes the one or more parity data segments.

6. The apparatus of claim 1, further comprising an update module that updates a data segment by:

receiving an updated data segment, the updated data segment corresponding to an existing data segment of the N data segments stored in the N storage devices;

copying the updated data segments to a storage device of a stripe storing existing data segments and also to one or more parity-mirror storage devices of the stripe;

replacing existing data segments stored in the storage devices of the N storage devices with the updated data segments;

the parity progression module, in response, replaces a corresponding existing data segment stored within the one or more parity-mirrored storage devices with the updated data segment, the parity progression module not generating the one or more parity data segments within the one or more parity-mirrored storage devices.

7. The apparatus of claim 1, wherein the set of first storage devices comprises a first set of storage devices, further comprising a mirror repair module to recover data segments stored in the storage devices of the first set of storage devices, the storage devices of the first set of storage devices being unavailable, to recover data segments from mirror storage devices containing copies of data segments, the mirror storage device comprising one of a set of one or more storage devices storing copies of the N data segments.

8. The apparatus of claim 7, wherein the mirror repair module restores the data segment in response to a read request from a client to read the data segment.

9. The apparatus of claim 8, wherein the mirror repair module further comprises a direct client response module that sends the requested data segment from the mirror storage device to the client.

10. The apparatus of claim 1, further comprising a pre-consolidation repair module that restores a data segment stored in a storage device of the storage device set that is unavailable in response to a request to read the data segment, the data segment being restored from the parity-mirrored storage device prior to the parity progression module generating the one or more parity data segments within the one or more parity-mirrored storage devices.

11. The apparatus of claim 1, further comprising a post-consolidation repair module to recover data segments stored within storage devices of the storage device set when the storage devices are unavailable, wherein the data segments are recovered using one or more parity data segments stored within one or more parity-mirrored storage devices after the parity progression module generates the one or more parity data segments in response to the storage consolidation operation.

12. The apparatus of claim 1, further comprising,

a data reconstruction module that stores a recovered data segment to a replacement storage device in a reconstruction operation, the recovered data segment matching an unavailable data segment stored within an unavailable storage device, the unavailable storage device comprising one of N storage devices, the reconstruction operation restoring the data segment to the replacement storage device to match a data segment previously stored within the unavailable storage device, the recovered data segment being restored by the reconstruction operation by one of:

if the matched data segment is located in the parity-mirror storage device, restoring according to the matched data segment stored in the parity-mirror storage device;

if the recovered data segment is not within the one or more parity-mirror storage devices, recovering from a mirror storage device containing a copy of the unavailable data segment, the mirror storage device comprising one of a set of one or more storage devices that store a copy of the N data segments; and

if the recovered data segment is not located within the one or more parity-mirror storage devices or the mirror storage device, the recovery is based on a regenerated data segment regenerated from the one or more parity data segments and available ones of the N data segments.

13. The apparatus of claim 1, further comprising:

a parity reconstruction module that reconstructs a recovered parity data segment within a replacement storage device in a parity reconstruction operation, the recovered parity data segment matching an unavailable parity data segment stored in an unavailable parity-mirror storage device, the unavailable parity-mirror storage device including one of the one or more parity-mirror storage devices, the parity reconstruction operation that re-stores the parity data segment to the replacement storage device to match a parity data segment previously stored in the unavailable parity-mirror storage device, the recovered parity data segment for the reconstruction operation being re-generated in one of the following ways:

regenerating with parity data segments stored within parity-mirror storage devices of a second storage device set, the second storage device set storing mirror copies of stripes;

if N data segments in the N storage devices are available, regenerating by using the N data segments stored in one of the N storage devices;

If one or more of the N data segments are not available from the N storage devices and the matching parity data segment is not available within the second set of storage devices, regenerating with one or more storage devices of the second set of storage devices storing copies of the N data segments; and

regeneration occurs using the available data segments and the non-matching parity data segments regardless of where the available data segments and the non-matching parity data segments are located on the one or more storage device sets.

14. The apparatus of claim 1, wherein the N storage devices comprise N solid-state storage devices, each of the N solid-state storage devices having a solid-state controller.

15. The apparatus of claim 1, wherein at least one of receiving a data storage request, computing a stripe shape and writing N data segments to N storage devices, writing a set of N data segments to a parity-mirror storage device, and computing a parity data segment occurs in one of:

a storage device of the storage device set;

a client; and

a third party RAID management device.

16. An apparatus for updating data in a progressive Redundant Array of Independent Drives (RAID) group, the apparatus comprising:

An update receiving module that receives an updated data segment, the updated data segment corresponding to an existing data segment of an existing stripe, the stripe including data from a file or object split into one or more stripes, each stripe including N data segments and one or more parity data segments, the N data segments being stored within storage devices of a set of storage devices assigned to the stripe, each parity data segment being generated from the N data segments of the stripe and stored in one or more parity-mirror storage devices assigned to the stripe, the set of storage devices including one or more parity-mirror storage devices, the existing stripe including the N existing data segments and the one or more existing parity data segments;

an update copy module that copies updated data segments into a storage device that stores corresponding existing data segments and also to one or more parity-mirror storage devices that correspond to existing stripes; and

a parity update module to compute one or more updated parity data segments for one or more parity-mirrored storage devices of an existing stripe in response to a storage consolidation operation that utilizes the updated one or more parity data segments to recover at least one storage space and data within the one or more parity-mirrored storage devices.

17. The apparatus of claim 16, wherein the updated parity data segment is calculated from the existing parity data segment, the updated data segment, and the existing data segment.

18. The apparatus of claim 17, wherein the action of the existing data segment is one or more of:

the existing data segment remains motionless until the existing data segment is read to generate an updated parity data segment;

in response to receipt of a copy of the updated data segment by a storage device of the N storage devices that stores an existing data segment, the existing data segment is copied to the data-mirror storage device;

the existing data segment is copied to the data-mirrored storage device in response to a storage consolidation operation of a storage device of the N storage devices storing the existing data segment.

19. The apparatus of claim 16, wherein the updated parity data segment is calculated from an existing parity data segment, an updated data segment, and a delta data segment, the delta data segment resulting from a difference between the updated data segment and the existing data segment.

20. The apparatus of claim 19 wherein said delta data segment has one of the following characteristics:

Prior to reading a delta data segment used to generate an updated parity data segment, the delta data segment is stored in a storage device that stores an existing data segment;

in response to a storage device storing an existing data segment receiving a copy of an updated data segment, the delta data segment is copied to a data-mirrored storage device;

the delta data segment is copied to the data-mirrored storage device in response to a storage consolidation operation of the storage device storing the existing data segment.

21. The apparatus of claim 16, wherein at least one of receiving the updated data segment, replicating the updated data segment, and calculating the updated parity data segment occurs in one of:

a storage device of the storage device set;

a client; and

a third party RAID management device.

22. The apparatus of claim 16, wherein the storage consolidation operations are performed autonomously according to operations of the update receiving module and the update replicating module.

23. An apparatus for reliably storing data with high performance, the apparatus comprising:

a set of storage devices, the set of storage devices assigned to a stripe, the set of storage devices comprising N storage devices and one or more parity-mirror storage devices other than the N storage devices;

a striping module that computes a stripe shape for data, the stripe shape containing one or more stripes, each stripe comprising a set of N data segments, and writes the N data segments of the stripe to N storage devices, wherein each of the N data segments are written to a different storage device in the set of storage devices;

a parity-mirror module to write a set of N data segments of a stripe to each of one or more parity-mirror storage devices; and

a parity progression module to compute one or more parity data segments of a stripe in response to a storage consolidation operation, the one or more parity data segments computed from N data segments stored within one or more parity-mirror storage devices, the parity progression module to further store the parity data segments to each of the one or more parity-mirror storage devices, the storage consolidation operation to be performed autonomously from storage operations of the storage request receiving module, the striping module, and the parity-mirror module, the storage consolidation operation to recover at least one storage space and data in the one or more parity-mirror storage devices.

24. The system of claim 23, further comprising one or more servers comprising N storage devices and one or more parity-mirror storage devices.

25. The system of claim 24, further comprising one or more clients within the one or more servers, wherein the storage receiving module receives the request from at least one of the one or more clients.

26. A computer program product comprising a computer readable medium having computer usable program code executable to perform reliable, high performance storage of data, the operations of the computer program product comprising:

receiving a data storage request, wherein the data comprises data of a file or data of an object;

computing a stripe shape for the data, the stripe shape containing one or more stripes, each stripe comprising a set of N data segments, and writing the N data segments to N storage devices, wherein each of the N data segments is written to a different storage device in the set of storage devices assigned to the stripe;

Writing a set of N data segments of a stripe to one or more parity-mirror storage devices of a set of storage devices, the one or more parity-mirror storage devices being devices other than the N storage devices; and

in response to a storage consolidation operation, computing parity data segments for a stripe, the parity data segments computed from N data segments stored by a parity-mirror storage device, the parity data segments being stored in the parity-mirror storage device, the storage consolidation operation performed autonomously from receiving a request to store the N data segments, writing the N data segments to the N storage devices, or writing the N data segments to one or more parity-mirror modules, the storage consolidation operation to restore at least one storage space and data in the parity-mirror storage device.

27. A computer program product comprising a computer readable medium having computer usable program code executable to perform reliable, high performance storage of data, the operations of the computer program product comprising:

Receiving updated data segments, the updated data segments corresponding to existing data segments of an existing stripe, the stripe including data from a file or object partitioned into one or more stripes, each stripe including N data segments and one or more parity data segments, the N data segments stored within storage devices of a set of storage devices assigned to the stripe, each parity data segment generated by the N data segments of the stripe and stored within one or more parity-mirror storage devices assigned to the stripe, the set of storage devices including one or more parity-mirror storage devices, the existing stripe including the N existing data segments and the one or more existing parity data segments;

copying the updated data segments to a storage device storing corresponding existing data segments and also to one or more parity-mirror storage devices corresponding to existing stripes; and

one or more updated parity data segments of one or more parity-mirrored storage devices of an existing stripe are computed in response to a storage consolidation operation that recovers at least one storage space and data in the one or more parity-mirrored storage devices using the updated one or more parity data segments.