CROSS-REFERENCE TO RELATED APPLICATIONS This application is a divisional application of and claims the benefit of “METHOD, SYSTEM, PROGRAM, AND DATA STRUCTURE FOR CLEANING A DATABASE TABLE”, having application Ser. No. 09/399,694, filed Sep. 21, 1999, the disclosure of which is incorporated herein by reference in its entirety.
This application is related to the following commonly-assigned patents and co-pending patent applications, all of which are filed on the same date herewith, and which are incorporated herein by reference in their entirety:
- “Method, System, Program, And Data Structure for Transforming Database Tables,” to Mark A. Cesare, Tom R. Christopher, Julie A. Jerves, Richard H. Mandel III, and having attorney docket number ST9-99-034, application Ser. No. 09/400,507, and U.S. Pat. No. 6,920,443, issue don Jul. 19, 2005;
- “Method, System, Program, And Data Structure for Pivoting Columns in a Database Table,” to Mark A. Cesare, Julie A. Jerves, and Richard H. Mandel III, and having attorney docket number ST9-99-035, application Ser. No. 09/400,057, and U.S. Pat. No. 6,604,095, issued on Aug. 5, 2003;
- “Method, System, and Program for Inverting Columns in a Database Table,” to Mark A. Cesare, Julie A. Jerves, and Richard H. Mandel III, and having attorney docket no. ST9-99-038, application Ser. No. 09/400,690, and U.S. Pat. No. 6,748,389, issued on Jun. 8, 2004; and
- “Method, System, Program, And Data Structure For Cleaning a Database Table Using a Look-up Table,” Mark A. Cesare, Julie A. Jerves, and Richard H. Mandel III, and having attorney docket no. ST9-99-036, application Ser. No. 09/401,006, and U.S. Pat. No. 6,965,888, issued Nov. 15, 2005.
BACKGROUND OF THE INVENTION 1. Field of the Invention
The present invention relates to a method, system, program, and data structure for cleaning a database table and, in particular, for performing clean operations on columns in the database table.
2. Description of the Related Art
Data records in a computer database are maintained in tables, which are a collection of rows all having the same columns. Each column maintains information on a particular type of data for the data records which comprise the rows. A data warehouse is a large scale database including millions or billions of records defining business or other types of transactions or activities. Data warehouses contain a wide variety of data that present a coherent picture of business or organizational conditions over time. Various data analysis and mining tools are provided with the data warehouse to allow users to effectively analyze, manage and access large-scale databases to support management decision making. Data mining is the process of extracting valid and previously unknown information from large databases and using it to make crucial business decisions. In many real-world domains such as marketing analysis, financial analysis, fraud detection, etc, information extraction requires the cooperative use of several data mining operations and techniques.
Once the desired database tables have been selected and the data to be mined has been identified, transformations on the data may be necessary. Transformations vary from conversions of one type of data to another, e.g., converting nominal values into numeric ones so that they can be processed by a neural network, to definition of new attributes, i.e., derived attributes. New attributes are defined either by applying mathematical or logical operators on the values of one or more database attributes. The transformed data is stored in a target database where it may then be mined using one or more techniques to extract the desired type of information necessary to make the organizational decisions. Further details of data mining are described in the International Business Machines Corporation (IBM) publication entitled “White Paper: Data Mining Solutions” (IBM Copyright, 1996)
Data transformation refers to the process of filtering, merging, decoding, and translating source data to create validated data for the data warehouse and data mining tools. For example, a numeric regional code might be replaced with the name of the region. Data transformations and cleansing is used when data is inconsistent or incompatible between sources. In such case, some level of data cleansing is needed to ensure data consistency and accuracy. Some of the current techniques for transforming and cleansing data include the use of an SQL WHERE clause to limit the rows extracted from the source table. Further, formulas and expressions specified in the column definition window and constants and tokens are used to eliminate and modify data.
Previous versions of IBM Visual Warehouse included programs to allow users to perform numerous functions on the source data. For instance, if one database table has revenue data in U.S. dollars and another data table stores revenue data in foreign currency denominations, then the foreign revenue data must be cleansed before both sets of data can be analyzed together. Transformation operations may be performed using application programs external to the database program that process and transform tables of data records. Further details of data warehousing and data transforms, are described in the IBM publications “Managing Visual Warehouse, Version 3.1,” IBM document no. GC26-8822-01 (IBM Copyright, January, 1998), which is incorporated herein by reference in its entirety.
Notwithstanding current programs for cleansing data, there is a need in the art to provide users greater control over operations to clean input data.
SUMMARY OF THE PREFERRED EMBODIMENTS To overcome the limitations in the prior art described above, preferred embodiments disclose a method, system, program, and data structure for performing a clean operation on an input table. The input table to clean is indicated in an input data table name. At least one rule definition is processed to clean the input table. Each rule definition indicates a find criteria, a replacement value, and an input data column in the input table. For each rule definition, the input data column is searched for any fields that match the find criteria. The replacement value for the particular rule definition is inserted in the fields in the input data column that match the find criteria. Subsequent applications of additional rule definitions applied to the same input data column operate on replacement values inserted in the input data column during previously applied rule definitions. At least one of said rule definition includes multiple find criteria and a corresponding replacement value for each find criteria.
In further embodiments, each rule definition is associated with one rule table including the find criteria and replacement value. In such case, a rule table column parameter is provided for each rule definition indicating the columns in the rule table including the find criteria and replacement value for that rule definition. In certain embodiments, two rule definitions may have the same rule table. In such case, the rule table column parameters indicate different columns in the same rule table including the find criteria and replacement value for each rule definition. In still further embodiments, a separate rule table may include the find criteria and replacement value for different rule definitions.
Still further, a rule definition may include multiple find criteria and a corresponding replacement value for each find criteria. In such case, the step of searching the input data column comprises applying each of the multiple find criteria to one field until a match occurs or none of the multiple find criteria are found to match the field content. When a match is found, the replacement value corresponding to the find criteria is inserted in the field having the matching content.
In preferred embodiments, the rule definition may define a find and replace rule, a discretization rule or a numeric clip rule. Different rule definitions may define different rule types.
In preferred embodiments, the rule definitions may be communicated from one computer system, such as a client, to a computer system including the input data table, such as a database server. The rule definitions are then executed against the input table on the database server including the input tables.
Preferred embodiments provide a data command structure including one or more rule definitions for performing different operations on the data in an input data table. The preferred embodiments provide a command structure that accommodates multiple types of clean operations to be performed on an input data table before the input data table is written to the output table. Further, preferred embodiments allow a client to transfer clean commands including to the database server including the database for execution on the database server. This reduces network traffic as the database tables subject to the clean operation do not have to be transferred between the database server and the client constructing the clean commands. Further, in preferred embodiments, the rule definitions are maintained in rule tables in the server. This further reduces network traffic as the clean command need only specify the location of rules to apply and does not have to provide tables of rules.
BRIEF DESCRIPTION OF THE DRAWINGS Referring now to the drawings in which like reference numbers represent corresponding parts throughout:
FIG. 1 illustrates a computing environment in which preferred embodiments are implemented;
FIG. 2 illustrates the parameters used in a transform command to clean input tables in accordance with preferred embodiments of the present invention;
FIGS. 3a,3b,4, and5 illustrate examples of a rule table to clean data in accordance with preferred embodiments of the present invention;
FIGS. 6a,6b,6c,6d, and6eillustrate logic to clean an input data table in accordance with preferred embodiments of the present invention;
FIG. 7 illustrates an example of an input data table; and
FIGS. 8a,8b,8c, and8dillustrate examples of rule table to apply to clean columns in the input data table inFIG. 7 in accordance with preferred embodiments of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS In the following description, reference is made to the accompanying drawings which form a part hereof and which illustrate several embodiments of the present invention. It is understood that other embodiments may be utilized and structural and operational changes may be made without departing from the scope of the present invention.
Computing EnvironmentFIG. 1 illustrates a computing environment2 in which preferred embodiments are implemented. The environment2 includes aserver4 andclient6. Theserver4 andclient6 would include an operating system, such asMICROSOFT WINDOWS 98 and WINDOWS NT, AIX, OS/390, OS/400, OS/2, and SUN SOLARIS,** and may be comprised of any suitable server and client architecture known in the art. Theserver4 andclient6 include a database program8aand8b, wherein8acomprises theserver4 side of the database program and8bcomprises theclient6 side. Theserver4 andclient6 may communicate via any communication means known in the art, such as a telephone line, dedicated cable or network line, etc, using any protocol known in the art including TCP/IP network (e.g., an Intranet, the Internet), LAN, Ethernet, WAN, System Area Network (SAN), Token Ring, etc. Alternatively, there may be separate and different networks between theservers4 andclient6.
**Microsoft, Windows, Windows NT are registered trademarks and Microsoft SQL Server is a trademark of Microsoft Corporation; DB2, AIX, OS/390, OS/400, and OS/2 are registered trademarks of IBM; and Oracle8 is a trademark of Oracle Corporation; and Solaris is a trademark of Sun Microsystem, Inc.
The client/server database programs8a, b, may be comprised of any client/server database program known in the art, such as DB2, Oracle Corporation's ORACLE 8, Microsoft SQL Server,** etc. The database programs8aand8bare used to access operations and perform operations with respect to information maintained in one ormore databases10. The database(s)10 would consist of multiple tables having rows and columns of data, e.g., tables14 and18. Further details of the architecture and operation of a database program are described in the IBM publications “DB2 for OS/390: Administration Guide,Version 5” IBM document no. SC26-8957-01 (Copyright IBM. Corp., June, 1997) and “A Complete Guide to DB2 Universal Database,” by Don Chamberlin (1998), which publications are incorporated herein by reference in its entirety.
**Microsoft, Windows, Windows NT are registered trademarks and Microsoft SQL Server is a trademark of Microsoft Corporation; DB2, AIX, OS/390, OS/400, and OS/2 are registered trademarks of IBM; and Oracle8 is a trademark of Oracle Corporation; and Solaris is a trademark of Sun Microsystem, Inc.
In preferred embodiments, the clean transform program is implemented using the IBM stored procedure database program structure. A stored procedure is a block of procedural constructs and embedded SQL statements, i.e., an application program, that is stored in a database and can be called by name. Stored procedures allow an application program to execute in two parts. One part runs on the client and the other on the server. This allows one client call to produce several accesses of the database from the application program executing on the system, i.e., server including the database. Stored procedures are particularly useful to process a large number of database records, e.g., millions to billions of records, without having to transfer data between theserver4 andclient6. The client stored procedure passes input information to the server stored procedure which then, executing within the database program8 including thedatabase10, processes numerous database records according to such client input information. The server stored procedure program is initiated by the client, and during execution the client cannot communicate with the stored procedure executing in the server. Further details of stored procedures are described in the publication “A Complete Guide to DB2 Universal Database,” “A Complete Guide to DB2 Universal Database,” which was incorporated by reference above.
The clean transform of the preferred embodiments is implemented as a storedprocedure application program12 in theserver4. The clean transform storedprocedure12 receives as input a name of an input table14 in thedatabase10, and transformrules16 from theclient6 specifying the clean operations to perform on the data in the named input table14. The results of the clean operations performed by the clean transform storedprocedure12 in response to the transform rules16 are generated into the output table18. Alternatively, the transformed, i.e., cleaned input table is written to thedatabase10 to overwrite the previous version of the input table.
The client side of the clean transform storedprocedure20 generates the transform rules16 that specify the clean operations to perform and initiate execution of the clean transform storedprocedure12. The rules specified by theclient side20 are capable of implementing the following clean operations at the server4: replace values in selected data columns that are missing, nonvalid, or inconsistent with appropriate substitute values; remove unsuitable data rows; clip numeric values; perform numeric discretization; remove excess white space from text; copy columns from the source table to the target table. Further, the client generatedrules16 can instruct the clean transform storedprocedure12 to ignore case and white spaces when locating strings and permit a tolerance value for numeric data.
In preferred embodiments, the client side8bcan pass parameters to the clean transform stored procedure as “host variables” or embedded in a CALL statement. In either case, the parameters or transformrules16 must be specified to control the operation of theclean transform12.
Structure and Operation of the Clean Transform Rules A graphical user interface (GUI) at theclient6 may be used to enter various input parameters to control a clean operation. In response to such user input, theclient side20 would generate an application program interface (API) call to the clean transform storedprocedure12 including the parameters shown inFIG. 2 to initiate and control the operation of the clean transform storedprocedure12. Theclean parameter field50 shown inFIG. 2 includes aninput table name52,output table name54,log table name56,run ID58, and replacement rule definition(s)60.
Theinput table name52 specifies the input table14 table including the data subject to the clean operations and theoutput table name54, which is optional, specifies the output table18 where the cleaned data from the input table14 is placed. Thelog table name56 indicates the name of a log file into which warnings and exceptions that occur during execution of the clean transform storedprocedure12 are stored. Therun ID58 appears in the log file and is used to identify the transform operations for which the log entry was made. The replacement rule definition(s)60 identifies an input column from the input table14 and an output column in the output table18, and one or more rules to use when processing the input columns. The specified field ofparameters50 defines the transform rules16 the client side of the clean transform storedprocedure20 presents to the clean transform storedprocedure12 in theserver4.
Thereplacement rule definitions60 parameter comprises one or more definitions, such that eachdefinition80 includes sub-parameters82-100. Thus, therule definitions parameter60 may specify multipledistinct rule definitions80, each describing particular operations to perform on specified columns in the input table. Below is a description of the sub-parameters included in each definition80 a user may specify in the rule definitions parameter60:
- InputData Column Name82—defines an existing column in the table having the InputData Table Name52 that contains the data to be operated on in accordance with the rules specified in the rule table and other parameters. This parameter is required in thedefinition80. The input data column must be capable of being modified, so that the clean transform storedprocedure12 will not update a value which has a constraint, such as for a unique key column or a referential constraint.
- Output Data Column Name84: defines the name of a column in the table having the OutputData Table Name54 where cleaned data is placed. This parameter is optional, and the default is the inputdata column name82. Thus, if no output data column or output table is specified, clean data is placed back into the copy of the input table in memory being processed.
- Rule Table Name86: Provides the name of the table containing the “find and replace” rule patterns. This sub-parameter is optional; if not provided the values for the column are copied to the output table, and are only modified if the option to compress or remove white space is specified. In preferred embodiments, the rule tables are maintained in theserver4 and are specified in the rule definitions.
- Rule Table Sort-Key Column Name88: When a rule table includes multiple rules to apply to the input data column, thisparameter88 indicates a sort key-column in the rule table that provides an ordering in which the rules are applied to rows (fields) in the input data column.
- Rule90: This sub-parameter is required if therule table name86 is specified. This sub-parameter defines the type of rule included in the rule table to use when processing the values in the input data column. Further details ofrule90 are described below.
- Rule Table Columns92: This sub-parameter is specified if the rule table name is specified. The number of columns depend on the rules, examples of which are discussed below. If a rule table has different columns for different rule definitions, i.e., one rule table stores multiple rule definitions, then therule table columns92 parameter indicates those columns in the rule table including the specific rules for one rule definition.
- Row Clean Indicator94: Boolean YES/NO value. If YES is specified and there is a match in the input data column to the “find” condition, then the input value is not copied to the output table and the located matching rows are deleted from the temporary input table. Note that the setting this parameter to YES will affect subsequent attempts to process the transitional input data columns as the matching entries in the input data columns subject to the rowclean indicator94 are empty. If the value for this sub-parameter is NO, the matching entry in the input data column is copied to the output data column or the input data column. Further, any replacement value in a rule table is ignored and need not be specified when the row clean indicator is used. This sub-parameter is optional, and is ignored unless arule table name86 is specified.
- Rule Escape Character96: Certain characters have special meaning, such as the percent sign (%), which represents any string of zero or more characters, or the underscore character (), which represents a single character. The use of the rule escape character in the search string followed by the special character, means that the query looks for the actual special character following the escape character, and ignores the special meaning.
- White Space Indicator98: Optional boolean value that indicates whether white spaces are ignored when searching
- Numeric Tolerance100: Specifies a tolerance value when doing a “find” operation with respect to numeric values. Thus, numbers within the “tolerance” range of the searched upon number will produce a match.
A rule table may include columns for different rule definitions. In preferred embodiments the rule table is maintained in the database program8afor direct access by theclean transform12. In this way, users can utilize predefined clean rules in a rule table in the database program8a. The rule table may include one find column used in two rule definitions and different columns for the replacement values for the two rule definitions having the same find value. Thus, both rule definitions search on the same criteria, but provide different replacement values. Alternatively, the rule table may have different columns for the find criteria for two rule definitions, but the same replacement value for the different columns including the find criteria. Still further, the rule table may include multiple search criteria for a single rule definition and a corresponding search value for each of the search criteria. In this way, multiple search criteria could be applied to the same input column as part of one rule definition. Upon finding the first find value matching the field content, the replacement value corresponding to that matching find value would be inserted into the matching field.
Moreover, multiple rules in different columns of the same rule table or in different rule tables may sequentially operate against the same input data column. If more then one set of rules is to be applied to an input data column, adefinition80 must be specified for each rule. The order of the definitions listed in thereplacement rule definitions60 parameter indicates the order in which the rules are applied to the input data column. In this way, values in the input data column may be modified in sequence and processed according to clean operations specified in different rule definitions. In preferred embodiments, the processing of subsequent rule definitions on the same column will depend on any replacement values inserted in the field during the application of a previous rule definition, not the original content in the input data column. Thus, rule definitions are applied to the copy of the input data table in memory as it is being processed. The find operation with respect to a field will apply to any previously inserted replacement value inserted into the field. If this inserted replacement value matches the search criteria of the subsequent rule definition, then another replacement value will be inserted into the field.
Therule sub-parameter90 indicates the type of rule indicated in the table identified in the ruletable name sub-parameter86. Possible values for rule type could include find and replace, discretize, numeric clip or any other find and replace type operations known in the art. A find and replace locates a field in the input data column matching the find value. This find value is specified in one column of the rule table. The column including the find rule is provided in the first rule table column identified insub-parameter92. This rule table column would further identify a second column in the rule table including a replacement value. If the find operation produces a match on the find value, then the replacement value is inserted in the field having the content matching the find value. The specification of the value to find in the rule table column must match the type of data in the input data column, e.g., only numbers are allowed in numeric “find” columns, pattern strings are allowed for character columns. Further for either numeric or character data, the database null value can be used as a find or replacement value.
FIGS. 3aand3billustrate a possible format for rule tables for a find and replacement rule type.FIG. 3aillustrates a two column find and replace rule table for an input data column including character values. The find pattern is what is searched for in the input data column and the replacement text value is what is inserted into the field matching the find pattern.FIG. 3billustrates a similar two column find and replace rule table for an input column having numerical values.
If therule90 indicates a discretization type, then there is both an upper and lower bound for the find value. The find operation can specify to search for matching values in the input data column that are between the upper and lower bounds, outside of the upper and lower bounds, less than the lower bound, or greater than the upper bounds. Another column in the rule table would specify a replacement operation to perform. If the upper and lower boundaries specified character strings, then SQL rules would be used to determine whether strings in the input data column satisfy the search criteria.FIG. 4 illustrates a table format for discretization for character and numeric data types. The replacement value is inserted into every field in the input data column having a value between the upper and lower bounds. Placement of the NULL indicator in the upper or lower bounds can affect whether the find seeks all fields in the input data column greater or less than one of the bounds. The discussion below with respect toFIGS. 6band6cexplains how a NULL value in one of the find values affects the search criteria.
If therule90 indicates a numeric clip type, then the find operation finds fields in the input data column that are less than, equal to or greater than specified values.FIG. 5 shows that discretization includes an upper and lower values for both the find and replace. Any values in the input data column less than the lower bound and greater than the upper bound are replaced with the lower and upper replacement value, respectively. As with discretization, the use of the NULL value in one of the find values can affect the find operation, as discussed below with respect toFIGS. 6dand6e.
For find and replacement, discretization, and numeric clip rule types, the database NULL value can be used as the find or replacement value. The NULL value indicates the absence of information. The database NULL value is only allowed for a pattern find value or replacement value if the database input column allows for NULL values, i.e., the input data column was not defined with a NOT NULL clause. The database NULL value may also be used to determine how to consider lower and upper bound values in the discretization and numeric clip operations, as described below.
FIGS. 6a, b, c, d, andeillustrate logic implemented in the clean transform stored procedure12 (clean transform12) to clean input data columns according to theclean parameters50 includingdefinitions80 in the replacement rule definition(s)60 parameter. Theseparameters50 andsub-parameters80 comprise the transform rules16. As discussed, a user at theclient6 would specify certain clean operations to perform and the client side of theclean transform20 would generate an API function call including theclean parameters50 anddefinitions80 from information the user entered in the GUI. Control begins atblock200 with theclean transform12 receiving theclean parameters50 and a set of sub-parameters80 for each rule definition in an API function call. Theclean transform12 accesses (at block202) the input table14 from thedatabase10 indicated in the inputdata table name52. For eachrule definition80 specified in thereplacement rule definitions60 parameter, theclean transform12 executes a loop to perform operations inFIGS. 6a, b, c, beginning atblock204.
In processing each rule definition, theclean transform12 accesses (at block206) the input data column from the input table14 indicated in the inputdata column name82 parameter for the rule definition and accesses (at block208) the rule table in theserver4 indicated in the ruletable name parameter86 and input columns in the rule table indicated in the ruletable columns parameter92 for the rule definition. Theclean transform12 determines (at block210) whether the rule table sort-key column name is non-empty. If so, theclean transform12 sorts (at block210) the rules in the rule columns for the rule definition according to the order specified in the sort-key column in the rule table. As discussed, a sort-key column may be provided if a rule table includes multiple rules to apply for the rule definition. Otherwise, if the sort key-column parameter88 is empty, the rule columns are sorted in ascending order. Afterblock212 or the yes branch ofblock210, theclean transform12 determines (at block214) whether therule parameter90 indicates a find and replace clean operation.
If the clean operation for the current definition is a find and replace, theclean transform12 determines (at block216) one or more find values or patterns (if the data type is a character) from the first column in the rule table indicated in the ruletable columns parameter92. Theclean transform12 inserts (at block218) the replacement value/pattern in the column in the rule table having the same column number as the second column in therule table columns92 sub-parameter if one of the find values or patterns matches the field content. Note that if there are multiple find values/patterns in the rule table, then there is a different corresponding replacement value for each separate find/value pattern. The replacement value corresponding to the first matching find value is inserted in the field. The clean transform then returns (at block220) to block204 to process thenext rule definition80 in therule definition parameters60.
During the find and replace searching, and the searching operations for the discretization and numeric clip rule types, other sub-parameters are considered. If a rule escape character is indicated insub-parameter96, theclean transform12 searches for a character matching the character following the escape character, which may be a character that usually has specific search meaning, such as a percent (%), underscore (), comma (,) or semicolon (;). Further, if the rowclean indicator sub-parameter94 is set to YES, then any field, i.e., row, in the input data column matching the find value/pattern is not copied to the output table in the current rule definition being applied, or any further rule definitions that may apply to the field or row. If anumeric tolerance sub-parameter100 is provided, than number fields in the input data column will return a match if the field value is equal to the find value within a range specified by the numeric tolerance. Further, if the ignore whitespace indicator sub-parameter98 is YES, then white spaces are removed from the field when determining whether the field matches the find value/parameter; otherwise, white spaces are considered characters.
If the rule indicated in therule sub-parameter90 is discretization (at block240), then theclean transform12 determines (at block242) the upper bound, lower bound and replacement values from the accessed rule table. The upper, lower, and replacement columns to access from the rule table maintained in the database program8aare indicated in the first, second, and third columns in the rule table columns sub-parameter92, respectively. Theclean transform12 determines (at block244) whether the upper and lower bounds are both NULL. If so, theclean transform12 does the replacement (at block246) in every field in the current input data column. Otherwise, theclean transform12 begins a loop atblock248 to perform for every field in the input data column. In this loop, theclean transform12 executes a nested loop betweenblocks249 and274 to consider each rule in the rule table, when there are multiple rules. For each rule, theclean transform12 determines (at block250) whether the accessed field in the input data column is NULL. If so, theclean transform12 determines (at block252) whether either the upper or lower bounds are NULL. If so, theclean transform12 inserts (at block254) the replacement value in the rule table into the field; otherwise, no replacement value is inserted (at block256). Fromblocks254 or256, control transfers to block258 where theclean transform12 performs another iteration of the loop beginning atblock248 for the next field (row) in the input data column.
If the field in the input data columns is not NULL (from the NO branch at block250), then theclean transform12 determines (at block260) whether both the upper and lower bounds are NULL. If so, theclean transform12 inserts the replacement value into the field and proceeds to block258 to consider any further rows in the input data column. Otherwise, theclean transform12 determines (at block264) whether only the upper bound is NULL. If so, theclean transform12 inserts (at block266) the replacement value into the current field if the field in the input data column is greater than the lower bound minus the numeric tolerance for numeric data types where a numeric tolerance is specified. If no numeric tolerance is provided inparameter100 or the input column is a character type, then there is no consideration of numeric tolerance when performing the find operation. If only the lower bound is NULL (at block268), then theclean transform12 inserts (at block270) the replacement value into the current field if the field in the input data column is less than the upper bound plus any numeric tolerance for numeric data types where a numeric tolerance is specified. If neither the upper nor lower bound are NULL, then theclean transform12 inserts (at block272) the replacement value in the field if the field value is less than or equal to the upper bound plus any provided numeric tolerance AND less than or equal to the lower bound minus any provided numeric tolerance. After applying one rule in the rule table atblocks254,256,262,266,270 or272, theclean operation12 then considers any further rules atblock274 for the current row in the input data table. After finding a match or considering all rules in the rule table for a given field, the clean transform proceeds (at block258) to consider the next row (field) in the input data column.
After all rows in the input data column are considered fromblocks246 or258, the clean transform proceeds (at block276) to block204 to execute the next rule definition against a specified input data column.
If the rule indicated in therule parameter90 is a numeric clip (atblock300 inFIG. 6d), then the clean transform determines (at block302) the upper bound, lower bound, and replacement value in the columns of the rule table indicated in the ruletable columns parameter92. The clean transform then determines (at block302) whether both upper and lower bounds are NULL. If so, theclean transform12 skips (at block304) the input data column, and inserts no replacement values into any row. Otherwise, theclean transform12 begins a loop atblock306 to process each row in the input data column. Theclean transform12 begins a nested loop atblock307 to consider each rule in the rule table, if there are multiple rules. Theclean transform12 determines (at block308) whether the field content is NULL. If so, theclean transform12 does not insert (at block310) the replacement value for any into the field and proceeds (at block312) to consider the next row in the column until all rows are processed. If only the upper bound is null (at block314), then theclean transform12 inserts the lower replacement value in the field (at block316) if the field value is less than the lower bound minus any numeric tolerance indicated insub-parameter100. If only the lower bound is NULL (at block318), then theclean transform12 inserts the upper replacement value in the field (at block320) if the field value is greater than the upper bound plus any numeric tolerance. If neither the lower nor upper bound are NULL, then theclean transform12 inserts (at block322) the lower replacement value if the field is less than or equal to the lower bound minus any numeric tolerance OR inserts the upper replacement value if the field is greater than or equal to the upper bound plus any tolerance. After applying one rule against one field atblocks310,316,320 or322, theclean transform12 then proceeds (at block323) back to block307 to consider the next rule in the rule table, if there are further rules. After applying all rules in a rule table to one field (row), theclean transform12 proceeds (at block312) back to block36 to consider the next field (row) in the input data column.
After processing all rules in the rule table and rows in the input data column fromblock312 or block324, theclean transform12 proceeds (at block324) to block204 to consider any further rule definitions in the replacementrule definition parameter60. After thelast rule definition80 in the replacementrule definitions parameter60 is processed atblocks220,274 or324, then theclean transform12 then determines (at block326) whether the outputdata table name54 specifies an output data table18 in thedatabase10 to receive any cleaned or modified input data columns. If so, then theclean transform12 writes (at block328) the input data columns, including replaced and cleaned fields, to the specified output data table19. Otherwise, the modified and processed input data columns are written (at block330) to the input table14.
Preferred embodiments provide a command data structure to control a stored procedure program to clean columns of data in an input table thedatabase10. The rules to clean the tables may be provided in a rule table data structure stored in theserver4 that provides one or more clean rules for different columns and in different sequences. This allows a single instance of execution of the clean program to perform numerous types of operations on the columns in the input table. Preferred embodiments thus provide a flexible and programmable data structure and program to provide fine grained control of clean operations. Further, with preferred embodiments, theclient6 does not increase network traffic because the client does not transfer the tables or the rule tables to clean between thedatabase server4. Instead, theclient6 merely provides a command data structure including various parameters and rules to a storedprocedure12 that executes in theserver4 to perform the clean operations within the database program8 on theserver4. Such savings in network traffic can be significant when very large database tables, including millions or billions, of records are cleaned.
FIGS. 7 and 8a, b, c, dillustrate an example of how rule tables may be applied to clean an input data table.FIG. 7 illustrates an input data table to be cleaned including columns concerning sales information: city, indicating the sale location; product name; product category; dollar sales; and unit sales.FIGS. 8a, b, c, anddillustrate different rule tables to apply to columns in the input table inFIG. 7.FIG. 8aillustrates a find and replace rule to locate certain strings in the City column that begin and end with “%”. The first percent in the find value represents any string of zero or more characters. An escape character of “/” is indicated inparameter96 so the search criteria includes strings that start and end with the percent sign (%). The replace value is on the right. Thus,FIG. 8aillustrates a rule table including multiple rules having search criteria to apply against each field in a column, e.g., the CITY column, of the input data table.FIG. 8awould remove the abbreviations in the CITY input data column and replace them with a complete city name for the replacement value corresponding to the find value matching the field in the City column.
FIG. 8bprovides a numeric clip rule table. In the Unit Sales column, values less than 100 are considered invalid and will be rounded up to 100, and any sales number greater than 15,000 is considered NULL. Later, the rule table inFIG. 8dwill be used to flag rows with a NULL value to avoid in calculations. Thus, the rule table inFIG. 8bprocesses each field in the Unit Sales data column and replaces values less than 100 with 100 and replaces values greater than 15,000 with NULL. A numeric tolerance may be specified to apply the rule to values “close enough” to the lower or upper bound.
FIG. 8cis a discretization rule table used to replace the code values in the Product Category column of the table inFIG. 7 with meaningful descriptive terms. Code values within certain string ranges are a particular type of product. For instance, a Category code in the range from “HA” to “JZ” is a hardware product. The application of the rule table inFIG. 8cdetermines which rule has a lower and upper bound that includes the Product Category field and applies the corresponding Replace Value for the rule having the matching find bounds. The Upper Bound in the last row of the rule table inFIG. 8cincludes a NULL value. This means the replacement value is inserted in any field in the Product Category column having a string value greater than “RX.”FIG. 8cfurther provides a sort column, which would be included in the sort-keycolumn name parameter88, indicating the order in which to sort the rows in the rule table before applying the rule table to the input column
FIG. 8dis a find and replace table used to clean up the already processed data in the input table inFIG. 7. A rule definition including the rule table inFIG. 8dand having the rowclean indicator parameter94 set to YES would eliminate from further processing and from the final revised table any row in the input table (FIG. 7) having a product name of NULL. If the compress white space indicator is set in the parameters with the rule definition includingFIG. 8dapplied to the Product Name column, then any white space will be removed from the product name, such as the spaces between “Pick Axe” in the first data row of the input table inFIG. 7.
CONCLUSION This concludes the description of the preferred embodiments of the invention. The following describes some alternative embodiments for accomplishing the present invention.
The preferred embodiments may be implemented as a method, apparatus or article of manufacture using standard programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. The term “article of manufacture” (or alternatively, “computer program product”) as used herein is intended to encompass one or more computer programs and data files accessible from one or more computer-readable devices, carriers, or media, such as a magnetic storage media, “floppy disk,” CD-ROM, a file server providing access to the programs via a network transmission line, holographic unit, etc. Of course, those skilled in the art will recognize that many modifications may be made to this configuration without departing from the scope of the present invention.
Preferred embodiments were described with respect to specific data structures, such as a rule table having columns of rules, and an arrangement of parameters to provide a vehicle for transferring commands to the clean transform stored procedure program. However, those skilled in the art will recognize that modifications may be made to the architecture of the data structures used to convey multiple clean rules and still remain within the scope of the present invention.
Preferred embodiments were described with respect to three rule types, find and replace, discretization, and numeric clip. In further embodiments, other types of rules may be provided and included in the command data structure of the preferred embodiments to perform different types of clean operations known in the art.
In preferred embodiments, the clean transform program was executed in a stored procedure type program, such as that used in the IBM DB2 database system. However, in further embodiments, different types of application programs, other than stored procedure programs, may be executed in theserver4 or even theclient6 to perform clean operations in accordance with the command data structures of the preferred embodiments.
In preferred embodiments, the input table and output table were included in a database in the server in which the clean transform program is executing. In alternative embodiments, the rule, input, and output tables may be distributed at different storage locations at different network devices.
In preferred embodiments, a client constructed the clean operation command and communicated such commands to the database server. In alternative embodiments, the clean operation command of the preferred embodiments may be executed on the machine used to construct the command.
In summary, preferred embodiments disclose a method, system, program, and data structure for performing a clean operation on an input table. The input table to clean is indicated in an input data table name. At least two one rule definition is processed to clean the input table. Each rule definition indicates a find criteria, a replacement value, and an input data column in the input table. For each rule definition, the input data column is searched for any fields that match the find criteria. The replacement value for the particular rule definition is inserted in the fields in the input data column that match the find criteria. Subsequent applications of additional rule definitions applied to the same input data column operate on replacement values inserted in the input data column during previously applied rule definitions.
The foregoing description of the preferred embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the invention be limited not by this detailed description, but rather by the claims appended hereto. The above specification, examples and data provide a complete description of the manufacture and use of the composition of the invention. Since many embodiments of the invention can be made without departing from the spirit and scope of the invention, the invention resides in the claims hereinafter appended.