FIELD OF THE INVENTIONThe present invention claims priority to commonly owned Indian Patent Application Serial No. 358/DEL/2007, entitled SYSTEM AND METHOD FOR INDEXING USER DATA ON STORAGE SYSTEMS, by Yusuf Batterywala, on Feb. 21, 2007, the contents of which are hereby incorporated by reference.
FIELD OF THE INVENTIONThe present invention relates to storage systems and, in particular, to indexing user data on storage systems.
BACKGROUND OF THE INVENTIONA storage system is a computer that provides storage service relating to the organization of information on writable persistent storage devices, such as memories, tapes or disks. The storage system is commonly deployed within a storage area network (SAN) or a network attached storage (NAS) environment. When used within a NAS environment, the storage system may be embodied as a file server including an operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on, e.g. the disks. Each “on-disk” file may be implemented as a set of data structures, e.g., disk blocks, configured to store information, such as the actual user data for the file. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored. As used herein a file is defined to be any logical storage container that contains a fixed or variable amount of data storage space, and that may be allocated storage out of a larger pool of available data storage space. As such, the term file, as used herein and unless the context otherwise dictates, can also mean a container, object or any other storage entity that does not correspond directly to a set of fixed data storage devices. A file system is, generally, a computer system for managing such files, including the allocation of fixed storage space to store files on a temporary basis.
The file server, or storage system, may be further configured to operate according to a client/server model of information delivery to thereby allow many client systems (clients) to access shared resources, such as files, stored on the storage system. Sharing of files is a hallmark of a NAS system, which is enabled because of its semantic level of access to files and file systems. Storage of information on a NAS system is typically deployed over a computer network comprising a geographically distributed collection of interconnected communication links, such as Ethernet, that allow clients to remotely access the information (files) on the storage system. The clients typically communicate with the storage system by exchanging discrete frames or packets of data according to pre-defined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP).
In the client/server model, the client may comprise an application executing on a computer that “connects” to the storage system over a computer network, such as a point-to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet. NAS systems generally utilize file-based access protocols; therefore, each client may request the services of the storage system by issuing file system protocol messages (in the form of packets) to the file system over the network identifying one or more files to be accessed without regard to specific locations, e.g., blocks, in which the data are stored on disk. By supporting a plurality of file system protocols, such as the conventional Common Internet File System (CIFS), the Network File System (NFS) and the Direct Access File System (DAFS) protocols, the utility of the storage system may be enhanced for networking clients.
A SAN is a high-speed network that enables establishment of direct connections between a storage system and its storage devices. The SAN may thus be viewed as an extension to a storage bus and, as such, an operating system of the storage system enables access to stored information using block-based access protocols over the “extended bus”. In this context, the extended bus is typically embodied as Fibre Channel (FC) or Ethernet media adapted to operate with block access protocols, such as Small Computer Systems Interface (SCSI) protocol encapsulation over FC or TCP/IP/Ethernet.
A SAN arrangement or deployment allows decoupling of storage from the storage system, such as an application server, and some level of information storage sharing at the application server level. There are, however, environments wherein a SAN is dedicated to a single server. In some SAN deployments, the information is organized in the form of databases, while in others a file-based organization is employed. Where the information is organized as files, the client requesting the information maintains file mappings and manages file semantics, while its requests (and server responses) address the information in terms of block addressing on disk using, e.g., a logical unit number (lun).
Certain storage systems may support multi-protocol access and, to that end, enable clients to access data via both block and file-level requests. One example of such a storage system is described in U.S. patent application Ser. No. 10/215,917, entitled MULI-PROTOCOL STORAGE APPLIANCE THAT PROVIDES INTEGRATED SUPPORT FOR FILE AND BLOCK ACCESS PROTOCOLS, by Brian Pawlowski, et al.
One common use for a storage system that supports block-based protocols is to export one or more data containers, such as luns, for use by a client of the storage system. The client typically includes an operating system and/or a volume manager that forms the data containers into one or more volume (or disk) groups. A volume group is a set of luns aggregated to provide a storage space that may be utilized by the client to overlay one or more file systems or other structured storage thereon. As used herein, the term storage space means storage managed by a client that utilizes one or more data containers hosted by one or more storage systems, an example of which is a file system overlaid onto a volume group that comprises one or more luns stored within a plurality of volumes of a single storage system or within a plurality of volumes of a plurality of storage systems. Another example of a storage space is a volume group managed by a client to enable an application, such as a database application, to store structured data thereon.
Storage system users often may wish to search the data containers stored on a storage system to identify those containers that contain user data matching one or more search criteria, such as phrases and terms. As noted, a data container may include a file, a directory, a virtual disk (vdisk), or other data construct that is addressable via a storage system. For example, a user may wish to search and locate all data containers serviced by the storage system that contain user data matching the phrase “Accounts Receivable.” By enabling searching of data containers on storage systems, users may improve utilization of their data, especially in large enterprises where the number of data containers may be in substantial, e.g., the tens or hundreds of millions.
To identify data containers that contain user data that match the search criteria, a search process may need to examine all of the data containers within the storage system every time a search is requested. In a typical storage system having a substantial number of data containers, this is not a practical solution due to the substantial amount of time required to access and process every data container to determine if it contains the search criteria. To enable faster searching, a search index of information associated with the data containers may be generated for the storage system. The storage system search index may be constructed by performing a file system “crawl” through the entire file system (or other data container organizational structure) serviced by the storage system. Typically, a file system crawl involves accessing every data container within the file system to obtain the necessary index information. However, such a file system crawl is expensive both in terms of disk input/output operations and processing time, and suffers from the same practical problems of directly accessing each data container. That is, the file system crawl may substantially impede access to the file system, e.g., for tens of minutes at a time, which results in an unacceptable loss of performance.
Furthermore, the file system crawl is typically performed at regular intervals (periodically) to maintain up-to-date index information. As a result of the substantial processing time required, a further disadvantage of the file system crawl is that the periodic search index information may be inconsistent with the current state of the file system, i.e., the index information only represents the file system as of the completion of the last file system crawl.
A further noted disadvantage arises in a storage system environment where a client overlays a file system or other structured storage onto storage space provided by a storage system. In such an environment, indexing functionality within the storage system may not operate as the overlaid file system that may utilize a different format than that of the storage system's native file system format. This prevents a storage system administrator, who may support a plurality of differing vendors of clients, from being able to quickly and efficiently search through user data to enable users to identify data containers containi- desired search terms.
SUMMARY OF THE INVENTIONThe present invention overcomes the disadvantages of the prior art by providing a system and method for indexing user data of data containers stored on storage space provided by one or more storage systems. A management module configured to implement indexing and searching functionality executes on a management server that is operatively interconnected with each storage system. Each client of the storage system executes a novel client side agent that is configured to detect changes to user data stored by the client on the storage system. In response to detecting that the data has been changed/modified, the agent examines each modified data container and parses the modified data to identify new and/or modified index terms or the creation/deletion of data container. Notably, the client-side agent may utilize client based file system (or other storage management) functionality to access the data overlaid onto the storage space exported by a storage system.
Once the data has been parsed to identify new/modified index terms, the agent transmits the parsed data to the data management module executing on the management server. The data management module receives the parsed data and updates a search database using the received data.
Upon initiating a search, the user enters a search query into a user interface of the data management module. In response, the data management module formulates a database query forwards the query to the search database, which process the query and forwards the query to the search database. The search database processes the query and returns the results to the data management module. The data management module then displays the results of the query to the user. The data management module may filter the displayed results based on access controls determined by, e.g., permissions associated with the user entering the query.
BRIEF DESCRIPTION OF THE DRAWINGSThe above and further advantages of invention may be better understood by referring to the following description in conjunction with the accompanying drawings in which like reference numerals indicate identical or functionally similar elements:
FIG. 1 is a schematic block diagram of an exemplary storage system environment in accordance with an illustrative embodiment of the present invention;
FIG. 2 is a schematic block diagram of an exemplary storage operating system executing on a storage system in accordance with an illustrative embodiment of the pre-sent invention;
FIG. 3 is a schematic block diagram showing file systems overlaid onto a volume group comprising one or more luns in accordance with an illustrative embodiment of the present invention;
FIG. 4 is a schematic block diagram of an exemplary word table data structure of a database in accordance with an illustrative embodiment of the present invention;
FIG. 5 is a schematic block diagram of an exemplary file table data structure of a database in accordance with an illustrative embodiment of the present invention;
FIG. 6 is a schematic block diagram of an exemplary content table data structure of a database in accordance with an illustrative embodiment of the present invention;
FIG. 7 is a flowchart detailing the steps of a procedure for updating search index information in accordance with an illustrative embodiment of the present invention; and
FIG. 8 is a flowchart detailing the steps of a procedure for querying a search database in accordance with an illustrative embodiment of the present invention.
DETAILED DESCRIPTIONA. Network Environment
FIG. 1 is a schematic block diagram of anenvironment100 including astorage system120 that may be advantageously used with the present invention. The storage system is illustratively a computer that provides storage service relating to the organization of information on storage devices, such asdisks130 of adisk array160. It should be noted that in alternate embodiments, a plurality ofstorage systems120 may be utilized. As such, the description of a single storage system should be taken as exemplary only. Thestorage system120 comprises aprocessor122, amemory124, anetwork adapter126 and astorage adapter128 interconnected by asystem bus125. Thestorage system120 also includes astorage operating system200 that preferably implements a high-level module, such as a file system, to logically organize the information as a hierarchical structure of data containers, such as directories, files and special types of files called virtual disks (hereafter logical units or “luns”) on the disks.
In the illustrative embodiment, thememory124 comprises storage locations that are addressable by the processor and adapters for storing software program code and data structures. The processor and adapters may, in turn, comprise processing elements and/or logic circuitry configured to execute the software code and manipulate the data structures.Storage operating system200, portions of which are typically resident in memory and executed by the processing elements, functionally organizes thesystem120 by, inter alia, invoking storage operations executed by the storage system. It will be apparent to those skilled in the art that other processing and memory means, including various computer readable media, may be used for storing and executing program instructions pertaining to the inventive technique described herein.
Thenetwork adapter126 comprises the mechanical, electrical and signaling circuitry needed to connect thestorage system120 to aclient110 over acomputer network105, which may comprise a point-to-point connection or a shared medium, such as a local area network (LAN) or wide area network (WAN). Illustratively, thecomputer network105 may be embodied as an Ethernet network or a Fibre Channel (FC) network. Theclient110 may communicate with the storage system overnetwork105 by exchanging discrete frames or packets of data according to pre-defined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP) or SCSI encapsulated in FC (FCP).
Theclient110 may be a general-purpose computer configured to execute an operating system (OS)172, anapplication173 and anovel agent175. Theoperating system172 may be a conventional operating system such as Microsoft Windows available from Microsoft Corp., a Linux-based operating system, etc. The operating system may include the functionality of a file system and/or a volume manager and execute one ormore applications173, such as a database that utilizes raw storage space available by the operating system. Theoperating system172 andapplications173 executing thereon may utilize storage exported by thestorage system120. As noted above, theoperating system172 may overlay a file system or other form of structured storage onto one or more exported luns from thestorage system120.
Anovel agent175 executes within theclient110 to identify modifications that occur to data containers managed by the client and to update adata management module155 executing on amanagement module150 in accordance with the principles of the pre-sent invention. Specifically, theagent175 tracks modifications to the data containers by, e.g., routinely scanning for changes to, e.g., user data of the containers. In response to identifying such changes, the agent parses the data to identify new and/or modified index terms. In alternate embodiments, theagent175 may include the capability to have plugins associated therewith. Such plugin modules may add functionality to parse differing data formats to enable theagent175 to parse a greater variety of data container formats.
These parsed index terms are then forwarded todata management module155 executing onmanagement server150. Illustratively, themanagement server150 is a separate computer executing withinenvironment100. However, in alternate embodiments, the functionality of themanagement server150 and/ordata management module155 may be integrated withclient110 and/orstorage system120. As such, the description of aseparate management server150 should be taken as exemplary only. Thedata management module155 provides functionality for indexing and searching user data overlaid onto storage space provided by thestorage system120. Themanagement server150 also includes an exemplary user interface152 to enable administrators and/or other users access to thedata management module155 for purposes of, e.g., entering search queries.
Themanagement server150 is operatively interconnected with asearch database158 utilized to maintain index information for user data. Thesearch database158 may be implemented within themanagement server150 or via a separate database server. As described further below, thesearch database158 may be implemented using a variety of data structures, e.g., tables to track particular search terms, as well as the data containers containing those terms for purposes of responding to queries entered by a user.
Theclient110 may interact with thestorage system120 in accordance with a client/server model of information delivery. That is, the client may request the services of the storage system, and the system may return the results of the services requested by the client, by exchanging packets over thenetwork105. The clients may issue packets including file-based access protocols, such as the Common Internet File System (CIFS) protocol or Network File System (NFS) protocol, over TCP/IP when accessing information in the form of files and directories. Alternatively, the client may issue packets including block-based access protocols, such as the Small Computer Systems Interface (SCSI) protocol encapsulated over TCP (iSCSI) and SCSI encapsulated over Fibre Channel (FCP), when accessing information in the form of blocks.
Thestorage adapter128 cooperates with thestorage operating system200 executing on thesystem120 to access information requested by a user (or client). The information may be stored on any type of attached array of writable storage device media such as video tape, optical, DVD, magnetic tape, bubble memory, electronic random access memory, micro-electro mechanical and any other similar media adapted to store information, including data and parity information. However, as illustratively described herein, the information is preferably stored on thedisks130, such as HDD and/or DASD, ofarray160. The storage adapter includes input/output (I/O) interface circuitry that couples to the disks over an I/O interconnect arrangement, such as a conventional high-performance, FC serial link topology.
Storage of information onarray160 is preferably implemented as one or more storage “volumes” that comprise a collection ofphysical storage disks130 cooperating to define an overall logical arrangement of volume block number (vbn) space on the volume(s). Each logical volume is generally, although not necessarily, associated with its own file system. The disks within a logical volume/file system are typically organized as one or more groups, wherein each group may be operated as a Redundant Array of Independent (or Inexpensive) Disks (RAID). Most RAID implementations enhance the reliability/integrity of data storage through the redundant writing of data “stripes” across a given number of physical disks in the RAID group, and the appropriate storing of parity information with respect to the striped data. An illustrative example of a RAID implementation is a RAID-4 level implementation, although it should be understood that other types and levels of RAID implementations may be used in accordance with the inventive principles described herein.
B. Storage Operating System
To facilitate access to thedisks130, thestorage operating system200 implements a write-anywhere file system that cooperates with virtualization modules to “virtualize” the storage space provided bydisks130. The file system logically organizes the information as a hierarchical structure of named data containers, such as directories and files, on the disks. Each “on-disk” file may be implemented as set of disk blocks configure to store information, such as data, whereas the directory may be implemented as a specially formatted file in which names and links to other files and directories are stored. The virtualization modules allow the file system to further logically organize information as a hierarchical structure of data container, such as blocks, on the disks that are exported as named luns.
In the illustrative embodiment, the storage operating system is preferably the NetApp® Data ONTAP® operating system available from Network Appliance, Inc., Sunnyvale, Calif. that implements a Write Anywhere File Layout (WAFL®) file system. However, it is expressly contemplated that any appropriate storage operating system may be enhanced for use in accordance with the inventive principles described herein. As such, where the term “WAFL” is employed, it should be taken broadly to refer to any file system that is otherwise adaptable to the teachings of this invention.
FIG. 2 is a schematic block diagram of thestorage operating system200 that may be advantageously used with the present invention. The storage operating system comprises a series of software layers organized to form an integrated network protocol stack or, more generally, a multi-protocol engine that provides data paths for clients to access information stored on the storage system using block and file access protocols. The protocol stack includes amedia access layer210 of network drivers (e.g., gigabit Ethernet drivers) that interfaces to network protocol layers, such as theIP layer212 and its supporting transport mechanisms, theTCP layer214 and the User Datagram Protocol (UDP)layer216. A file system protocol layer provides multi-protocol file access and, to that end, includes support for the Direct Access File System (DAFS)protocol218, theNFS protocol220, theCIFS protocol222 and the Hypertext Transfer Protocol (HTTP)protocol224. AVI layer226 implements the VI architecture to provide direct access transport (DAT) capabilities, such as RDMA, as required by theDAFS protocol218.
AniSCSI driver layer228 provides block protocol access over the TCP/IP network protocol layers, while aFC driver layer230 receives and transmits block access requests and responses to and from the storage system. The FC and iSCSI drivers provide FC-specific and iSCSI-specific access control to the blocks and, thus, manage exports of luns to either iSCSI or FCP or, alternatively, to both iSCSI and FCP when accessing the blocks on the storage system. In addition, the storage operating system includes a storage module embodied as aRAID system240 that manages the storage and retrieval of information to and from the volumes/disks in accordance with I/O operations, and adisk driver system250 that implements a disk access protocol such as, e.g., the SCSI protocol.
Bridging the disk software layers with the integrated network protocol stack layers is a virtualization system that is implemented by afile system280 interacting with virtualization modules illustratively embodied as, e.g.,vdisk module290 andSCSI target module270. Thevdisk module290 is layered on thefile system280 to enable access by administrative interfaces, in response to a user (system administrator) issuing commands to the storage system. TheSCSI target module270 is disposed between the FC andiSCSI drivers228,230 and thefile system280 to provide a translation layer of the virtualization system between the block (lun) space and the file system space, where luns are represented as blocks.
The file system is illustratively a message-based system that provides logical volume management capabilities for use in access to the information stored on the storage devices, such as disks. That is, in addition to providing file system semantics, thefile system280 provides functions normally associated with a volume manager. These functions include (i) aggregation of the disks, (ii) aggregation of storage bandwidth of the disks, and (iii) reliability guarantees, such as mirroring and/or parity (RAID). Thefile system280 illustratively implements the WAFL file system (hereinafter generally the “write-anywhere file system”) having an on-disk format representation that is block-based using, e.g., 4 kilobyte (kB) blocks and using index nodes (“inodes”) to identify files and file attributes (such as creation time, access permissions, size and block location). The file system uses files to store metadata describing the layout of its file system; these metadata files include, among others, an inode file. A file handle, i.e., an identifier that includes an inode number, is used to retrieve an inode from disk.
Broadly stated, all inodes of the write-anywhere file system are organized into the inode file. A file system (fs) info block specifies the layout of information in the file system and includes an inode of a file that includes all other inodes of the file system. Each logical volume (file system) has an fsinfo block that is preferably stored at a fixed location within, e.g., a RAID group. The inode of the root fsinfo block may directly reference (point to) blocks of the inode file or may reference indirect blocks of the inode file that, in turn, reference direct blocks of the inode file. Within each direct block of the inode file are embedded inodes, each of which may reference indirect blocks that, in turn, reference data blocks of a file.
Operationally, a request from theclient110 is forwarded as a packet over thecomputer network105 and onto thestorage system120 where it is received at thenetwork adapter126. A network driver (oflayer210 or layer230) processes the packet and, if appropriate, passes it on to a network protocol and file access layer for additional processing prior to forwarding to the write-anywhere file system280. Here, the file system generates operations to load (retrieve) the requested data fromdisk130 if it is not resident “in core”. If the information is not in core, thefile system280 indexes into the inode file using the inode number to access an appropriate entry and retrieve a logical vbn. The file system then passes a message structure including the logical vbn to theRAID system240; the logical vbn is mapped to a disk identifier and disk block number (disk, dbn) and sent to an appropriate driver (e.g., SCSI) of thedisk driver system250. The disk driver accesses the dbn from the specifieddisk130 and loads the requested data block(s) inmemory124 for processing by the storage system. Upon completion of the request, the storage system (and operating system) returns a reply to theclient110 over thenetwork105.
It should be noted that the software “path” through the storage operating system layers described above needed to perform data storage access for the client request received at the storage system may alternatively be implemented in hardware. That is, in an alternate embodiment of the invention, a storage access request data path may be implemented as logic circuitry embodied within a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC). This type of hardware implementation increases the performance of the storage service provided bystorage system120 in response to a request issued byclient110. Moreover, in another alternate embodiment of the invention, the processing elements ofadapters126,128 may be configured to offload some or all of the packet processing and storage access operations, respectively, fromprocessor122, to thereby increase the performance of the storage service provided by the system. It is expressly contemplated that the various processes, architectures and procedures described herein can be implemented in hardware, firmware or software.
As used herein, the term “storage operating system” generally refers to the computer-executable code operable to perform a storage function in a storage system, e.g., that manages data access and may, in the case of a file server, implement file system semantics. In this sense, the ONTAP software is an example of such a storage operating system implemented as a microkernel and including the file system module to implement the file system semantics and manage data access. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows XP®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
In addition, it will be understood to those skilled in the art that the inventive technique described herein may apply to any type of special-purpose (e.g., file server, filer or multi-protocol storage appliance) or general-purpose computer, including a standalone computer or portion thereof, embodied as or including astorage system120. An example of a multi-protocol storage appliance that may be advantageously used with the present invention is described in previously mentioned U.S. patent application Ser. No. 10/215,917 titled MULTI-PROTOCOL STORAGE APPLIANCE THAT PROVIDES INTEGRATED SUPPORT FOR FILE AND BLOCK ACCESS PROTOCOLS, filed on Aug. 8, 2002. Moreover, the teachings of this invention can be adapted to a variety of storage system architectures including, but not limited to, a network-attached storage environment, a storage area network and disk assembly directly-attached to a client or host computer. The term “storage system” should therefore be taken broadly to include such arrangements in addition to any subsystems configure to perform a storage function and associated with other equipment or systems.
C. File System Organization
FIG. 3 is a schematic block diagram of anexemplary environment300 showing a number of file systems overlaid onto logical volumes existing within volume groups in accordance with an embodiment of the present invention. At the base of theenvironment300 aredisks130 associated withstorage system120. Overlaid onto thedisks130 is astorage system volume305 that includes a number ofluns310A-D, which may be exported by the storage system. In the illustrative environment300 a plurality of volume groups315 is maintained and managed by aclient110 of thestorage system120. The volume groups315 are illustratively managed by file system or volume manager functionality within theoperating system172 of the client170. Alternately, a volume manager system may be implemented to organize the volumes into volume groups for use by clients. As such, the description of the storage space as organized into client volume groups by a client operating system should be taken as exemplary only. Thefirst volume group315A comprisesluns310 A, B, whereas thesecond volume group315B comprisesluns310 C, D. The client may overlay a number of file systems onto logical volumes defined within the volume groups315 although this is not a requirement of the illustrative environment. That is, instead of overlaying file systems onto the logical volumes, the client may access the logical volume as a region of raw data storage. In theillustrative environment300, however,file system320A, B are overlaid onto hostlogical volume325A withinvolume group315A, and similarly,file system320C is overlaid onto hostlogical volume325B withinvolume group315 B.
As noted above, user data contained in file systems320 or logical volumes325 may be stored in a format different than that utilized by the storage system forvolume305 and/or luns310. Thus, theagent175 may utilize theoperating system172 to access the user data stored in file system320 and/or logical volumes325.
D. Indexing User Data
The present invention provides a system and method for indexing user data of data containers stored on one or more storage systems. This enables users to locate data quickly, without the requirement of performing slow file system crawls to locate data containers containing identified search terms. A management module configured to implement indexing and searching functionality executes on a management server that is operatively interconnected with the storage system. Each client of the storage system executes a novel client side agent that is configured to detect changes to data stored by the client on the storage system. In response to detecting that data has been modified, the agent examines the modified data containers and parses the modified data to identify new and/or modified index terms or the creation/deletion of data containers. Notably, the client-side agent may utilize client based file system (or other storage management) functionality to access the data overlaid onto storage space exported by the storage system.
Once the data has been parsed to identify new/modified index terms, the agent transmits the parsed data to the management module executing on the management server. The management module receives the parsed data and updates a search database using the received data.
Upon initiating a search, the user enters an appropriate search query into a user interface of the data management module. For example, a user may desire to locate all files with the term “Accounts Receivable.” In response, the data management module formulates database query and forwards the query to the search database, which process the query and returns results to the data management module. The data management module then displays the results of the query to the user. As noted, a crawl is not performed through the file system (or storage space) on the storage system to locate the data containers, thereby improving search times and reducing the amount of resources consumed by the search.
As noted above, thesearch database158 is utilized by thedata management module155 to track index information in accordance with an illustrative embodiment of the present invention. The database may be implemented using conventional database techniques, such as using a structured query language (SQL) database. Illustratively, the search database manages associations between specific index search words, files (or other data containers) and the storage system hosting the data containers.
Thesearch database158 illustratively implements a schema that organizes information as a plurality of data structures (such as tables) including, e.g., a wordtable data structure400, a filetable data structure500 and/or one or more contenttable data structures600. It should be noted that the description of various data structures contained within the search database should be taken as exemplary only and that alternate techniques for organizing search information may be utilized in accordance with the principles of the present invention. As such, the search database schema described herein should be taken as exemplary only.
FIG. 4 is a schematic block diagram of an exemplary wordtable data structure400 for use in a search database in accordance with an embodiment of the present invention. Thedata structure400 includes a plurality ofentries405, each of which comprises aword field410, a word identifier (ID)field415 and, in alternate embodiments,additional fields420. Eachword field410 contains a text string of a particular search indexed word. For example,word field410 may contain the string “revenue.” The word (ID)field415 contains a numeric identifier associated with the word contained within theword field410. Thus, for example, the index term “revenue” may be associated with a word ID of101. The word ID may then be utilized throughout the search database to represent the word contained inword field410. Thus the wordtable data structure400 associates words with numeric identifiers. For search terms that include multiple words, e.g., “Accounts Receivable,” a plurality of entries will be created, one for each word of the search term. In such embodiments, thedata management module150 may correlate searches so that the results displayed are only those data containers that include all for the words of the search term.
FIG. 5 is a schematic block diagram of an exemplary filetable data structure500 for use in a search database in accordance with an embodiment of the present invention. The filetable data structure500 includes a plurality ofentries505. Eachentry505 illustratively includes afilename field510, a file identifier (ID)field515, afile type field520, a storagesystem type field525, a storagesystem name field530, a host (client) namesfield535 and, in alternate embodiments,additional fields540. Thefilename field510 contains a file name of the data container stored within the storage system by the client. Thefile identifier field515 contains an identifier of the file or other data container. Thefile type field520 identifies a type of file, e.g., a text file, a Microsoft Word file, etc. The storagesystem type field525 identifies whether the data container is stored within, e.g., a SAN or a NAS environment. Illustratively this may be implemented as a Boolean value; however, in alternate embodiments, additional techniques to differentiate the storage environment may be utilized. The storagesystem name field530 identifies the storage system storing the data container identified by theentry505. The client namesfield535 identifies clients that have access to the identified data container.
FIG. 6 is a schematic block diagram of an exemplary contenttable data structure600 for use in a search database in accordance with an embodiment of the present invention. The contenttable data structure600 includes a plurality ofentries605. Eachentry605 includes afile ID field515, aword ID field415 and, in alternate embodimentsadditional fields610. Thefile ID field515 identifies a file (or other data container) stored by a client on the storage system. Theword ID field415 identifies a word from wordtable data container400 described above in reference toFIG. 4. Thus, thecontent data structure600 provides a mapping between files and search index words. By examining a number identifying a search index word using table400, aword ID415 may be located. Then, by examining the contenttable data structure600, one ormore file ID515 of data containers containing the identified word may be identified. Utilizing thefile identifiers515, the database may examine the filetable data structure500 to verify additional information relating to the data containers.
FIG. 7 is a flowchart detailing the steps of aprocedure700 for updating search index information in accordance with an embodiment of the present invention. Theprocedure700 begins instep705 and continues to step710 where theagent175 determines that data has been changed onclient110, by, e.g., periodically querying a file system or other component of theoperating system172 to identify recently modified data containers. Alternately, theagent175 may routinely perform a scan of the file system and/or overlaid storage to determine recently modified data containers. It should be noted that in an alternate embodiment of the present invention, the agent may be configured to ignore certain data containers from being indexed. For example, should an administrator know that a particular storage system is under heavy I/O, the administrator may configure the system to not scan and/or index data containers associated with that storage system.
In response to determining that data has been changed, theagent175 parses the changed data to identify new/modified index terms and/or new/deleted data containers instep715. Such parsing may be performed by, for example, theagent175 reading the new and/or modified data containers to identify certain search terms. The agent may parse the data by, for example, invoking file system and/or volume manager read functionality contained within the operating system. Alternately, theagent175 may utilize an application, such as a database application executing on the client, to perform read operations to identify the new index terms. The agent then transmits the parsed data to the data management module instep720. Theagent175 may transmit the data to the data management module using conventional remote procedure calls (RPCs). However, in alternate embodiments, theagent175 may transmit data using any acceptable point-to-point data transmission technique. As such, description of the utilization of RPCs should be taken as exemplary only.
As theagent175 executes on theclient110 and may access file system and/or volume manager functionality of theoperating system172, data may be indexed that is not stored in the storage system's native data format. That is, client data may utilize any form of data format overlaid onto luns (or other storage) exported from thestorage system120. By utilizing thenovel agent175, user data may be indexed regardless of the data format utilized by the storage system. Furthermore, the present invention permits user data indexing on storage systems that do not include indexing functionality within a storage operating system.
Thedata management module155 receives the parsed data instep725 and updates thesearch database158 with the received parsed data instep730. For example, data management module receives the parsed data and generates appropriate word entries within wordtable data structure400. Furthermore, the data management module may create additional associations within the contenttable data structure600. If the parsed data signifies that a new data container has been created, a new entry may be generated in the filetable data structure500. By updating the search database, the data management module enables future queries to return the most up-to-date information.
Procedure700 may be performed by a plurality of agents that may update a singledata management module155 andsearch database158. However, in alternate embodiments, a plurality of data management modules may be updated. If a single data management module and search database are utilized a user may be able to perform broader searches by querying a central data management module. Theprocedure700 then completes instep735.
FIG. 8 is a flowchart detailing the steps of aprocedure800 for responding to search queries in accordance with an embodiment of the present invention. Theprocedure800 begins instep805 and continues to step810 where a user enters a query into the user interface152 of thedata management module155. An exemplary query may be to identify all data containers containing a particular search term. Such a query may be utilized by the user (e.g. an administrator) to identify particular subsets of the overall data containers for use in data management. The user interface152 then forwards the query to thedata management module155 instep815. The user interface may forward the queries via a local procedure call (LPC), if the user interface and data management module execute on thesame management server150, or via a RPC should the data management module and the user interface execute on differing servers. It should be noted that in alternate embodiments other forms of interprocess communication may be utilized. Furthermore, the user interface may be integrated into the data management module, thereby obviating the need for interprocess communication. The data management module formulates a query, e.g., a SQL query, and forwards the query to the database instep820. The query represents the query entered by the administrator instep810 above. It should be noted that the description of the use of SQL should be taken as exemplary only and that other forms of database querying techniques may be utilized in accordance with the principles of the present invention.
The database performs the query and responds with the results of the query instep825. In the example of the database schema described above in reference toFIGS. 4-6, the database may examine the wordtable data structure400 to identify appropriate word IDs associated with the index terms to be queried. The word IDs may then be located within the contenttable data structure600 to identify one or more file identifiers of data containers including the identified search terms. The file identifiers may then be utilized to identify the appropriate entries in table500. The information in table500 may further be utilized to formulate the response to the query. The data management module displays results based on, e.g., access permissions via the user interface instep830. In the illustrative embodiment, the data management module may perform an access control check before displaying the results of the query. Thus, for example, should a plurality of clients utilize a common data management module, results returned from thesearch database158 may include matches from clients other than the one associated with the particular user. Thus, thedata management module155 may filter responses before display to pre-vent users from obtaining matches on other clients. Theprocedure800 then completes instep835.
The foregoing description has been directed to specific embodiments of this invention. It will be apparent, however, that other variations and modifications may be made to the described embodiments, with the attainment of some or all of their advantages. For instance, it is expressly contemplated that the teachings of this invention can be implemented as software, including a computer-readable medium having program instructions executing on a computer, hardware, firmware, or a combination thereof. Accordingly this description is to be taken only by way of example and not to otherwise limit the scope of the invention. Therefore, it is the object of the appended claims to cover all such variations and modifications as come within the true spirit and scope of the invention.