SYSTEM: AND METHOD FOR BUILDING AND
VALIDATING A CREDIT SCORING FUNCTION
CROSS-REFERENCE TO RELATED APPLICATIONS
The present application claims the benefit of and priority to U.S. Application Serial No. ! 3/6 2.260.. entitled "System And Method For Building And Validating A Credit Scoring Function", filed September 18, 2012, which is a continuation-in-part application of U,S. Application 13/454,970, entitled "System And Method For Providing Credit To Underserved Borrowers'' filed April 24, 2012, which claims priority to U .S. Provisional Application, No.61/545,496, entitled "Using Machine Learning to Guide Human Questioning in Underbanked Underwritingi! filed October 0).2011, which are all hereby incorporated in their entirety by reference.
TECHNICAL FIELD
[0001] This invention relates generall to the personal finance and banking field, and more particularly to the field of credit scoring me mods and systems.
BACKGROUND AND SUMMARY
[0002] People use credit daily for purchases large and small. In the 1950N credit decisions were made by bank credit officials; these officials knew the applicant, since they'usually lived in the same town, and would make credit decisions based on this knowiedge. This was effective, but extremely limited, since there are relatively fewer credit officials than otential borrowers, In the ιψ?ο the FICO seore made credit far more available, effectively removin the credit officer from the process. However, the risk management ftmetio.t* still needs to be done. Lenders, such, as banks: and credit card companies, use credit scores to evaluate the potential risk posed by lendin money to consumers. In order to determine wh is entitled to credit, and who is. not, banks: use credit scoring functions that purport to measure the: creditworthiness of a person or entity (be, the likelihood, that person will pay his or her debts). Traditional credit scoring functions are based on human-built transformations comprised of a small number of variables.
[0003] Traditional functions calculate a. creditworthiness score using, a three step process. First, they look, at sample data, for each variable (such as salary, credit rise, payoient history, etc). Second, the system will bin the yaiu.es of each variable by assigning a numerical score: (such as o to 10 for pa mexxt freqiiexiey; o ~ no payment history;: 1 = does not pay fr quentl and 10 ~ perfect: payment track record), Finally, after all the variables are transformed, the system will use either a. fixed formula, or a compilation of formulas, or a machine learning algorithm to construct: a. formula, to prodn.ce a composite score.
[0004] Traditional credit scoring transformations were largely developed in the 1950s and 1960s.* when computing power and access to inhumation was very difficult to acquire. Consequently traditional transformations are of t ie simplest form possible, and are limited to (a) single numeric variables for which, fill-in values are easy to compute; (b) straightforward numeric interpretations of non-ir meric variables; and or (e) string variables with very few values. For example, traditional transformations work for salaries (which are numbers), dates and times (when converted into a Julian date or equivalent), addresses (when considered a latitudedongitwde pairs), or even to payment frequencies, when constrained to recognizable patterns (monthly, semi-monthly, weekly, bi-weekly:, etc). These- transformations may even allow intermediary computations based on easily discovered relationships between, fields, such as the interval between two dates or the distance between two locations,
[00053 However, traditional credit scoring transformations do not work well for groups of variables, especially when data is partially or completely missing. And it doesn't work at all for data elements which can't be transformed. For example, an address record for Folsom State Prison may be represented as "P.O. Box 910, Represa, OA 9567:3" or ¾oo Prison. Road, Represa, CA 956 1*·, but both, refer to the same entity. Assuming a borrowers credit profile listed both addresses, a traditional credit scoring function might count the borrower as having multiple jobs, and in tu n, discount Ms/her credit score by incorrectly presnnhng that the borrower's employment is less stable (i.e. affectin a calculation for a predicted paycheck},
[0006] In addition, traditional credit scoring timsfbrmations are generall limited to correcting string variables (such as addresses) for misspellings or non-standard capiialkatioiL Advanced transformation are usually made by humans* Machine learning al¾orittm s are ^enerallv not employed, because of their limitations in cultural knowledge and understanding. For example, a human operator would analyze the borrower's employuient addresses at *P.(X Box: 910. Represa,. CA 56 $·* and "Post Office Bos: 910, Re resa, CA 95^ 1^ and be twiable to understand, that both, are: the: same location. This is normally managed by asking services to standardize addresses into USPS standard, form. However, significant: information is lost hy standardizing addresses, such as whether the ap !ieaBt sed upper case and lower eases or just lower case,
[0007] A : a consequence of the need for Iranian quality control,, traditional transforations are also limited in the amount of data which can he reasonably processed. Each, transformation and filling- in operation. may require a human to invest a significant amount of time to analyze one or more data fields, and then carefully manipulate t¾e contents of the field. Such restraints limit the number of fields to an amount which, can he understood by a single person in a reasonable period of time, and, as a result, there are relativel few risk models (such, as a FICO score hy Fair Isaac Corporation, Experian bureau scores,. Pinnacle by Equifax, or Precision by Tr nsition) with more than a few tens of yadabl.es (e»g* a FICO score is based on five basic metrics, including payment history, credit utilization, length of credit histor ,: types of credit used, and recent searches for credit). None of the traditional, credit scoring transformations consider hundreds of inputs v riables, muc less thousands, tens of thousands, or ni lii nS Adding all this data enables the automated models to mimic the old-world credit officers wh le still retaining - and increasing - credit availability.
[0008] Accordingly, improved systems for building and validatin credit scores would be desirable.
[0009] To improve upon, existing systems, preferred, embodiments of the present invention provide a system and method for building and. validating a credit seoring function based on a creditor's target, One preferred method for building and validating sneh a credit seoring foiiciiorr can. ineine!e generating a borrower dataset at a first-computer in response to receipt of a borrower profile (Raw Data;);: formattin tke borrower dataset into a lnraiiiy of variables ('Transfbrmed Data); indBpendentry processing, each, of the: plurality of variables nsing one or more algorithms (statistical , financial machine learning, etc*) to generate a plurality of independent decision sets describing specific aspects of a borrowe (M eta- Variables). As described below, the preferred method can fortiier hiehide feeding the Meia -Variables into -statistical, financial, and other algorithms each with a different predictive ¾Μ1Γ (Models). Each of the Models may then "vote*' their individual confidence, which theft may be ensembted. into a final score (Score). Other variations, features, and: aspects of the system and method of the preferred embodiment are described in detail below with reference to the appended drawings,
[ooio] The preferred embodiments of the present invention may also be used to provide a creditworthiness score for individuals who do not qualify under traditional credit seoring. Because certain borrowers either have an incomplete or non-existent record (based on the lack of data using traditional variables), traditional, credit scoring transf rniations ultimately- result in ''on-ereditwortliy'' scores. Thus, there are millions of individual who do not have access to traditional credit--the so-called ^ nderbanked^— who must survive day-to-day without such support from the financial and hanking industries. By utilizin the extremely broad scope o data available from public, proprietary, and social networking data sources, as well as from the borrower hiiiiself, the p esent invention allows a, lender t utilize new■ sources of information to compile risk profiles in ways traditional models could not accomplish* and. hi turn serve a completely new market. The present Invention could, be used independently (by simply generating individualized credit scores) or in the alternatives the present nventio could also be Interfaced with,, and used in conjunction with, a system and method for providing credit to. uuderserved borrowers. An exam le of such systems, and, methods is described, in US Patent Application, No. 13/454,970, entitled "System and Method for Providing Credit to tJnderserved Borrowers, to Douglas Merrill et at which Is hereby incorporated by reference in its entirety ("Merrill Application^).
[0011] Othe systems, methods, features and advantages of the invention will, he or will become apparent t one with skill in the art upon examination of the Mowing figures and. detailed description. It is intended, that all such additional systems, methods, features and. advantages be included within this description, be within the scope of the mveptlaii, and be protected by the accompanying claims,
BRIEF DESCRIPTION OF THE FIGURES
[0012] la order to better appreciate ho the above-recited, and other advantages and ob ects of the inventions: are obtained, a more particular' description of the embodiments briefly described above will be rendered by reference to specific embodiments thereof, which are illustrated in the accompanying drawings. It should be noted that the components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Moreover, in the figures, like reference numerals designate corresponding paris throughout the different views. Howe er,. like parts do not always have: like reference numerals. Moreover, all illustrations are intended to convey concepts, where relative sizes, shapes and. othe detailed attributes may be illustrated, schematically rather than literally or precisely,
[0013] FIGURE 1 is a schematic block diagram of a system for providing credit to iinderserved borrowers a fornid in the Merrill Application..
[0014] FIGURE : is a diagram of a system fo building and validating a credit seoring ftmetion in accordance with a preferred embodiment of the present invention,
[ooig] FIGURE 3 depicts an overall flowchart illustrating an exemplary embodiment of a. metliod by which, raw data is processed to hnild and validate a credit scoring function,
[0016] FIGURE 4 depicts an. overall flowchart: illustrating an exemplary embodiment of .a preferred method for building and validating a credit scoring function,
[0017] FIGURE g depicts a flowchart illustrating an exemplar embodiment of a method fo recognizing significant transformations.
[001.8] FIGURE 6 depicts a flowchart illustrating an exemplar embodiment of a method, for building, and validating scoring fhn.cti.ons based on the selected target.
[0019] FIGURE 7 is an example the: computerized screen of the personal information that may be requested by a lender from a borrower as fomid on the preferred embodiment of present: invention.
DEFINITIONS
[0020] The following definitions are not intended to alter the plain and ordinary meaning of the terms below but are instead intended to aid the reader in explaining the inventive concepts below::
[0021] As used herein, the term ''BORROWER DEVICE" shall generally refer to a desktop computer, laptop computer, notebook computer, tablet computer, mobile device such as a smart phone or personal digital assistant, smart TV, gaining console, streaming video player, or any other,, suitable: :¾etworkiBg device having web browser or stand-alone application configured to interfe.ee with and/or distribute the borrower profile to the CENTRAL COMPUTER, USER DEVICE, and/or one or more com onents of the preferred system 10*
[0022] As used herein, the term "USER DEVICE'5 shall generally refer to a desktop com ute ^ laptop computer, notebook computer, tablet computer, mobil device such as a. smart phone or personal digital assistant,: smart TV, -gaming console, streaming video player, or any other, suitable networking device having a web browser or stand-alone application configured, to Interface with and/or receive any or all data to/from the CENTRA!, COMPUTER, BORROWER DEVICE, and/or one or more coniponents of the preferred system 10.
[0023] As: used herein, the term. ''CENTRAL COMPUTER** shall generally refer to one or more snb-components or machines configured for receiving, manipnlatlng, configuring:, analyzing, s nthesizing; communicating* and/or processing data associated with the borrower* (including for example: a formal processing unit 40, a Variable processing unit 50, an ensemble module 60, a model processing unit 0, a data compiler 80, and a communications linb 90 - See Merrill Application), Any of the foregoing subcomponents or macbines ca optionally be integr ted into a. s ngle operating unit, or distributed throughout multiple hardware entities through networked or cloud-based resources.. Moreover, the central computer may be contigiired to interface with and/ or receive airy or all data to/from the USER DEVICE, BORROWER DEVICE, and/or one or more components of the preferred .system: 1 as shown in Figure 1 which is described in more detail in the Merrill Application,: incorporated by reference in its entirety
[0024] As used, herein, the term "PROPRIETARY DATA shall generally refe to data acquired by payment of a fee through, privately or goyernmentaily owned data stores (including, without limitation, through feeds, databases., or files containing data), One example of proprietary data may include data produced by a credit rating agency during a so-called credit cheek Another example is aggregations of publicly-available data over time or from multiple sources.
[0025] As used herein,: the term'"PUBLIC DATA* shall generally refer to data'available for free or at a nominal cost throogh one or more search strings, automated crawls, or scrapes using &ϊψ suitable searching, crawling, or scraping process, program, or protocol. One example of public' data may include data, produced by an internet search of a borrower's
[0026] As used herein, the term "SOCIAL NETWORK DATA shall generally refer to any data related to a borrower profile and/or any blogs,. osts, tweets, links, friends, likes, connections,, followers, fdllowhigs, pin (collectively a borrower's social graph) on a social network. Additionally, the social network, data, can include any social graph information for any or all members of the borrower's social network, thereby encompassing one o more degrees of separation between the borrower profile and the data extracted from the social network data. The social network data, may be available for free or at a nominal cost through direct or indirect access to one or more social networking and/o hlogging websites, including for example Google ·+» Faeebook, Twitter, Linkedln, Finterest, tutnblr, hlogspot, Wordpress, and Myspane.
[0027] AS: used herein, the term "BORROWER'S DATA'* shall generally refer to the borrower's data in his or her application for lendin as entered into by the: borrower, o on. the borrower's behalf, in the. BORROWER DE¥ICE, USER DEVICE, or CENTRAL COMPUTER. By way of example, this data may include the borrower's social security number, - driver's license number, date of birth, or otber information requested by a lender. .An example of a lender's computer application may be seen in FIGURE'7.
[00283 As lised herein, the term "RAW DATASETS'* shall generally refer to BORROWER'S .DATA, PROPRIETARY DATA, PUBLIC DATA, and SOCIAL NETWORK DATA, mdivldnaliv, collectively, or in one or more combinations. Raw datasets preferably function to accumulate, store, maintain, and/or make available biographical, financial, and/or social data relating to the borrower.
[0029] AS: used herein, the term NETWORK/* shall generally refer to any suitable combination o the global Internet, a wide area network (WAN), a local area network (LAN)., and/or. a near field network, as well as any suitable networking software, firmware, hardware, renters, modems,, cables, transceivers, antennas, and the like. Some or all of the components of the preferred system 10 ean access the network through wired or wireless means, and using any suitable communication protoeol/s, layers:, addresses, types of media, application programming interfece/s, and/o supporting eooimnnieatlans hardware, firmware, and/or software,
[0030] As used herein and in the claims, the sbigplar forms " *an,* and. ' e55 include plural references unless the context clearly dictates otherwise,
[0031] Unless defined otherwise, all t eehniea! an d scientific terms nsed herein have the same meanings as commonly understood by on of ordinary skill in the art,
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
[0032] The following description of the preferred embodiments of the invention is not intended to limit the invention to these preferred embodiments* but. rather to enable an person skilled in the art t mak and use this invention. Although any methods, materials, and devices similar or eq ivale t to those described herein ca be used, in the practice or testing of embodiments, the preferred methods, materials, and devices are now described ,
[0033] The present invention relates to improved, methods and systems for scoring borrower credit, which hielodes individuals, and other types of entities including, hut not limited to, corporations, companies, small businesses, and trusts, and any other recognized financial entity.
System:
[0034] As shown, in FIGURE s»,: a preferred, operating environnient for building and validating a eredit scoring foiietion in accordance with a preferred embodiment can generally include a BORROWER DEVICE 12S a USER DEVICE 30, a CENTRAL COMPUTER 0? a NETWORK 4 , and one or more data sources, including for e m le BORROWER'S DATA 13, PROPRIETARY DATA 14, PUBLIC DATA i6? and SOCIAL NETWORK DATA 18, The preferred system 10 can include at least a CENTRAL COMPUTER 20 arid/or a USER DEVICE.30,. which (individually or collectively) foiietion to provide a borrower with access to eredit based on a novel and unique set of metric derived from a plurality of novel and distinct sources, In par i ul , the preferred system 10 fenetjnns to determine the creditworthiness of borrowers, including the nnderhanked, by accessing, evaluating, measuring, qiiantlfying, and.'utilizing a measure of risk based on the novel and. unique methodology described: below as well as in the system and method identified, in the Merrill Application,, incorporated In its entirety by reference,
[0035] More specifically, this invention relates to the preferred methodology fo building and validating a credit scoring that takes place withi the CENTRAL COMPUTE 20 and/or a USER DEVICE 30, after all RAW DAXASBT8 are temporarily gathered o otherwise downloaded from the BORROWER DEVICE I2; CENTRAL COMPUTER 20, USER DEVICE 30, and/or one or more data sources, including for example BORROWER'S DATA 13, PROPRIETARY DATA 14, PUBLIC DATA 16, and SOCIAL NETWORK DATA, 18.
Method Overview-:
[0036] FIGURE ¾ pro¾dd.es a flowchart illustrating one preferred method by which the' RAW DATASETS 100 (called "Raw Data** in the figure) are processed, to build and validate a credit scoring function,
[0037] In the first step, the RAW DATASETS 100 are generated in response to receipt of a borrower's profile from one or more of the following BORROWER'S DATA 13, PROPRIETARY DATA 14, PUBLIC DATA 16, and SOCIAL NETWORK DATA 18. For example, the RAW DATASETS 100 may include classic financial data of the borrowers profile including items soch as their FICO score,, current salary, length of most recent employment, and the number of bankruptcies. Additionally, the RAW DATASETS 100 may'include other unique aspects of the borrower, soch as the nmnber of Internet domains owne* organizations the borrower has been or currently is involved with, how many la snits the borrower has: been named in, the number of friends the borrower has, the psychological characteristics based, on his or her interests, and other non-traditional aspects of the borrower's identity and history. Other examples include:
network Mend list postings [0038] By way of example and as nsed throughout this application, a small sampling of the RAW DAXA$I_TS 100 for fictitious borrower Ms, ¾*' (a creditworthy applicant) and fictitious borrower Mr. (a. rejected applicant) who reside and work near Represa, California, are:
Social Applicant and One (i ) registered Four (4) registered
Number
Effort Applicant Total time to Total time to
A:ii.V V-ftptv Vi. U-l vU.t-lisJiv tv- understanding dining application; 45 applications;: 7 lender's application minutes minutes
products process Lender documents Lender documents accessed (including accessed (including 3 loan application 3 loan application forms):. .15 forms): 3
[0039] Second, the RAW DATASETS are transformed into . plurality of variables (transformed data 120) in their most useful, form, For example, a ^cur ent income*' variable could either be left in its native form or converted into a scale (o ~ no income;: 1 ~ #1 ~ $5,000, 2 ~ $5,001 - $20,000, etc), or transformed to the percentile rank of the estimated, income when compared to the DMA area where the applicant lives. Alternatively the data for an address coold be converted into latitude and longitude pairs (e.g. for 300 Prison Road, Represa, CA 95671 transformed to Lat- 38,6931632; Lon , - •-■121,1616148), and thereafter me orthodromic distances to determine the likelihood that two listed addresses are In feet the same address. If the application is submitted by web site, then browser-related behavioral me surements, such as the number of pages viewed by the applicant and the amount of time the applicant spent on tlie actual application pages,, can also be: used, as numerical signals related to creditworthiness,
[0040] Thereafter, a computer (such as the CENTRAL CGMPUTER 20 in FIGURE 2) shall independently process each of the plurality of variables using one or more algorithms (statistical, financial, machine learning, etc.) to generate a plurality of independent decision sets describing specific aspects of a borrower (Meta Variables 140). Assuming 40 variables in. the RAW DATASE S, it is possible to generate (40s) ™ 1600 potential comparisons of two discrete variables, ( 03} ~ 64,000 well-formed expressions using three variables, an (404) ~ 2,560,000 well-formed expressions using four variables, and so forth, Clearly, the number of transformed data 120 variables will grow exponentiall in relation to the number of variables in the RAW DATASETS-
[0041] By way of example, the borrower's "current income" could, be compared to the average income, in e resa. for others who work in the same professioiL Similarly, the records of Applicant A¾ behavior during th application proces show significant ca e and effort invested in the lic tions while the record of Applicant B¾ behavior during the application process show a careless and slapdash approach to credit This. CQ'uM be transformed into an ordinal, variable on a 02 scale, where 0 indicates little or no care dining the application process and 2 indicates meticulous attention to detail during the: application process. Applicant A would receive a high score such as 2 and Applicant B would receive a far lower one,
[0042] One purpose of iiieta-variahles ar measure creditworthiness. However, that is. not their only function. For example, meta-variahles are very'useful at. the intermediate stage of constructing a credit scoring taction. There are three: broad reasons that it is a good idea to build, intermediate- metavariables when constructing a scoring function* Fi st, the effort required to select the p rameters that define a scorin function grows much: faster than the number of parameters does. For a regression niodeh fo instance, the amount of time to select n parameters grows as the cube of n. This means that the amount of computation required to directly estimate, more than a few hundred parameters is impractical. By contrast, if those parameters are covered, by a smaller collection of mefa-variables, the amount of time required to select the parameters is much smaller* Second* th smaller immber of parameters tends to make the behavior of the final scoring function more reliable:, as a role, optimization systems with more degree of freedom (parameters) require more information about the world in the process of parametric selection than do ndels with, fewer degrees of freedom. Using meta-variables reduces the number of parameters upon which the model depends, Third, and finally, metavariables are reusable - if a metavan&ble provides useful information to one scoring function, it will often provide useful information to other scoring functions* even if the risks being evaluated by those others are only tangeotiaiiy related to the one for which the metavariable wa originally defined..
[0043] Meta-variables may also be used to perform a "veracity check* of the borrower. For example, Mr. B in the above example would not pass the "Veracity check:" since his reported income is 50% more than, other individuals who work in the same profession in the same geographic area. Similarly, Ms, A would, get a score of 2 on the "careful customer*' test, which wou d usually he a. signal indicating creditworthiness in contrast to r, B? who would get a o on the same "careful customer" cheek, which would usually be a signal indicating less creditworthiness. Finally,. Ms, A would typically get a high score on a "personal stability scale, having been consistently reachabl at a small niiinber of addresses or phone nnmbers, where Mr.. B would typically get a lower score on. the same scale.
[0044] -Moreover, statistical analysis of me†a~variable$ are instrnetive as to which ^signa s
* are to be measured, and. what weight is to be assigned to each. Fo example,, consistency of residence
signal , while pluralit of addresses might generate no signal. The: preferred embodiments of the present invention is likewise instmctlye as to that determination, indeed, constructing meta-variables may not be a folly automated process:, but rather a. heuristic ne, calling for expert sM!L In general, however
? the process of constractiRg a metavariable proceeds a outlined next, (This document restricts its examples to the constroetion. of meta-variables related: to loan, risk assessment,: but the methodology is more generally applicable,) First, a data analyst Identifies a class of applications that have some commo property ~~ among loan, applications, this migh be a set of applications winch have higher or lower risk than average. The pntative "personal stability
*' and "earefal customer
* examples above could easily be reeogm e -- an analys -might notice that people who move very rarel are bette credit risks and. that people who; move frequently are: poorer credit risks. This class can be identified by a, wide collection of techniques,, ranging front manual examination of applications and outcomes to ¾ud features wMeli split risk'' to complex statistical techmejne In- which, clustering analysis is used on applications which were predicted incorrectly by an established .scoring procedure to find "predictive su sets'
*.
[0045] The purpose of a inetavariable is to create a real-value score which separates members of these classes from non-members*. This is typically performed by using a basic machine learning process to assemble one or more relatively simple expressions which ^separate the classes*', Sneh an expression might he the output of a. linear regression across a small constellation ø£ measured signals, possibly includin already-known metavariables, or a small classification or regression, tree applied to a similar constellation of signals. The critical features that make one of these metavariables something other than a true scoring function are (1) prizing simplicity and stability over accuracy - a metavariable doesn't need to be always right by itself* mxt most instead be a reliable signal which can be depended upo even if the environment changes: and. (at) aiming to. rovide correlative signals: related to a portion of the scoring problem instead of trying to directly provide a final value.
[0046] A single class of documents or applications can easily lead to sever l meta™variables; each of which measures a ''different** aspect of the class. Similarly*: a single document can serve as n exemplar i multiple classes; in feet, by so serving, such a. document provides direction about'how nieta-variables .should be assembled into a final scoring fonetioii,
[0047] ½ the preferred, method, the fourth, step includes feeding the
Meta-Variables into statistical, financial, and other algorithms each with a different predictive ¾ !Γ (Models 160). B way of example, a predicted payback model may easily add. simple meta-~varlables such as the ratio between the requested 'lo n alue" to ''current income/* or it may take the form- of complex algorithms such as borrowers social or financial volatility indices. For instance, one can nse traditional machine learning techniques, such as regression, models, classification trees, neural networks, or support vecto machines to build scoring systems on the basis of the past performance data, producing a variety of complex algorithms for quantifying aggregate risk
[0048] Finally, each of the Models ma then ' ote* their individual importance, which the may be assembled into a final score (Score 180). There are many ways to assemble scores using machine learning or statistical algorithms, but, for clarity, we provide a simple ex m le. In this trivial example, the score provided by each model could, be transformed. onto a percentile scale, and the median value of all the assigned scores could be computed. For instance* we ιύά use a group of models, one ("Mode! f) based o a random forest of classification trees, another. ("Model nM)? based on a logistic regression, and a third ("Model ΠΠ) based on a. neural network, trained with, back-propagation, and aggregate their results by averaging. This is complicated by the fact that the different models naturally return values on very different ranges, and so it is•preferable to pre-norniali¾e their scores before averagin them.
[0049] For clarity, assume that Model I returns 0,76 for Ms. A, Model LI eturns 0,023, and Model ΙΠ returns .0.95.. Assume further that these normalize to 83/100, 95/100, and 8o/ioo? respectively. Then the aggregate score for Ms* A would be the average of these values, or 86/100, For contrast, assume: that Model I returns 0,50 for Mr., B, Model II. returns o«oo6? and Model III returns ο,Βο, and that these normalize to 55/100, 48/100, and 62/100, respectively. In that case, the final score for Mr, B wonld be 55/100» the average of the three values. If one decided whether to grant a loan to an applicant only if their aggregate score was at least So. then Ms. A would he offered a loans, and Mr, B would be denied a loan.
[0050] As showing in the o^verview in .FIGURE 3, in. the preferred method, data contained in the MAW DATASETS 100 is gathered., cleansed., transformed in their most useful form, combined into meta-variables defining specific aspects of the buyer, fed in different: models, and finall assembled info a score for a. final creditworthiness decision, The following topics will be addressed in greater detail below: how the preferred method examines the broad categories of transformations which a e .availably how to select those which will be useful, how to enumerate computational strategies for handing the resulting flood of information, and how to point out the targets which are feasibly useful due to the greater amount of computation that may be performed. The training and validation process for risk measuring functions based on these inputs and targets follow:
Detailed Method:
[0051] As shown in FIGURE' , the preferred, method for building and validating a credit scoring metion involves the following steps: (a) recognizing significant transformations 200; (b) eh.oosi.ug an appropriate target for a scorin function 300; and (c) building and. validating scoring functions based on the selected target 400,
[0052] As shown, in FIGURE 5, the preferred method for recognising significant transformations 200, commences, with feedin the RAW DATASETS 1.00 into the following transformatlan processes: (a) an automatic search for continuous transformations 220; (h) a straiglx!lbr ard. functional transformations 240; and (c) complex functional transformations 260, which likely results in the creation of new transformed variables 1.20 and/or new meta. variables 140,
[0053] The automatic search for continuous transformations 220 iiiclnde the applieatlon of standard variable interpretatfon methods, such a (a) factorization for string variables with, relatively te distinct values., followed bv translation of thos terms into indicator categories when fill In is - necessary (b) conversion to doubles, for variables which may represent Boolean terms; (c) translation of elates into offsets relative to one or more base time stamps; (ci) translation of addresses or other geo-locaiion data, in a standard form, such as latitude-longitude representation, The application of automatic search, for continuous transformations 220 usually result in the creation, of transformed variables 120 and/or meta variables 140. However* If the automatic, search for continuous transformations 220 .determines that one or more of the variables in the RAW' DATASETS .100 does not require manipulation* the data ma not be transformed, and instead be passed through in. it native format. For Example, One can view the standard quartet of payment patterns (weekly, M-week!y, semi- nxontiil and monthly) as a factor variable with four levels, or as a set of four binary variables of which one if one and the other three are zero. Either of these interpretations is a standard, mechanically iniplemeutabie, example of this kind of transformation,
[0054] For instance, a variable that can assume the values "Paid weekly'', "paid biweekly*, "paid semimonthly*' or "paid monthly'' mid' fee transformed into four integral values from. 1 to 4,. or into four sets of quadruples,: (1, o, >, o), (o, 1, o, 0), (0, o, .1,0), and (0, o, 0, i), respectively, depending on how the .values would be used later on. The -values ^T ue" and. "False'* can be transformed into 0.0 and 1,0. Dates ca be transformed to date offsets - (e. . the date October 18» 1 6 could be represented as '"Day 22205 since January i, 1.900 *)' Finally, the address 300 Prison Road. Eepresa, CA 95671 can be converted to geographical coordinates: 38.6 31°^ 121 j.6i7aW, which, can be determined, to be 2353.6.2 miles from. 38,8977° N, 77.0366° W (the geographical coordinates of 1600 Pennsylvania Avenue, Washington, DC) Given the distance, a computer conld conclude, automatically, that someone residing at the first address was ver unlikely to work at the second (A human who saw these two addresse would know that someone wh resides at 300 Prison Road is an inmate at California's: oldest maximum-security prison, and would, fee unlikely to work at the White House, Computers don't have the cultural knowledge necessa:ry to draw that conclusion,) [0055] The resulting transformed variables 120 and/or meta variables 140 created by the automatic search for continuous transformations 220, are then fed into -.straightforward functional transformations 240, examples: of which include, (a) translation of singletons or small groups into outcome- related metrics, such as the. inferred probability of success or the expected value of -some: outcome variable .(e.g. expected payoff of a. single loan given a particular value, of the variable); (b) simple functional transformations of a variable (e , if a single field contains the count of events of a particular type, then that field will often follow a Poisson distribution. If so, then the square root of that field will closely follow a Gaussian, distribution with, a known mean and variance.). Moreover, the straightforward functional transformations 240 can employ other statistical algorithms as predictors, including for example a. Mabalanohis distance measure (such as a traditional Euclidean distance measure, a high-order distance measure, a Hamming distance measure), a non-mormall distributed distan.ee measure, and/or a Cosine transform. The application of straightforward functional transformations 240 usually resnlt in the creation of additional transformed variables 120 and/or nieta variables 140, However, if the straightforward functional transformations 240 determine that one or more of the variables in the RAW ! TASETS 100 does not require manipulation, the data may not be transformed,, and instead be passed through in its native format
[0056] For i.nstai ces consider the distance example given before. One could imagine transforming that distance into a measure of the probability that someone with a giveft. distance between, hnnae and. work, would pay off a loan. Presumably, that probability would be lower for someone: who lived, and. worked at the same location, would rise for a while, and would then tend to fall In the intermediary step of performing a straightforward functional transformation 2 0, the preferred enibodiment of the present: invention would look at all the address data for the borrower and determine whether the addresses are indeed likely to live and work within a conimntable distance, and verify the data set of addresses to work with.
[0057] Finally, the resulting transformed, variables 120 and/o metavariables 140 created, by either the automatic search, for continuous transformations 220 or tine straightforward functional transformations 240, are then fed into a complex functional transformations 260, examples of which include (a) transformations of singletons or small groups using careful selected and/or constructed fimctions; (b) distances between pairs of items (be, the absolute value of a difference for numerical fields, the Euclidean or taxi-cab distance for points in space, or even, a string edit distance for textual fields (the last of hich is of great value when dealing with user input, in order to differentiate between errors and fraud)); (c) ratios of items: (e.g. the ratio of debt service load to household disposable income); (&) other geometric transformations (e..g, the area, o a k-sinip!ex of suitable clusters of measures, a. generalization, of distance? and/or other comple measures of stability as a fonetlon of address can be computed); and. (ej eiistom-Ooiistr ctedjunctional transformations of data. The application of complex functional transformations 260 usually result in the creation of additional transformed ariables 120 and/or nieta variables 140, However, if the complex functional transformations 260 determine that one or more of the variables in the RA DATASETS 100 does not require nxanipuiation, the data may not he transformed, and instead be pssed through, in its native format,
[00583 Again, referring to the example two paragraphs above, wherein meta-variables could be used transforming that distance into a measure of the probability that someone with a given distance between home and work would pay off a loan, the final intermediary step are complex functional transformations 260 to determine the employment stability of the borrower. To the extent that the number of places someone has lived in a given period tends to obey a Poisson distribution with, mean proportional to the monber of jobs that person has held, transforming the pair of items consisting of the number of recent jobs and the number of recent addresses by taking the square root of both, turns fheni Info a set of pairs which are related by a. linear relationship phis a univariate Normal distribution with variance ¼ This, in turn, allows ns to easily distinguish people who've "just had a, lot of jobs,f from people lipVe. bad 'hnore addresses than one would expect given the: number of jobs they've held;'
[0059] Creating custom-constructed functional transformations, of data is closely related to large data analysis. Depending on the sl¾e of the RAW DATASETS ICQ,, the number of well-formed expressions (Lo. transformed variables 120 and/or nieta. variables 140) defining a function o a single variable may be extremely large, with, the number of well-fortned expressions defining a fonction of several variables, grows exponentially. For example, if there are 40 variables in the RAW 1X TA8ETS too, there ar (40s)~ 1,60.0· potential differences, (40s}~ 64,000 well- form d, expressions risin three variables in a "ratio of a single variable to the difference of two others^ and ( 0^)™: 2,560,000 well formed expressions of the form "ratio of the difference between two variable to the difference between two, potentially different, variables,*' With a. larger set of variables, the growth is much, faster. Searching such a. space is. itself, a difficult optiiiiizatioii problem, both because of the size of the space and, more importantly, because most functions are not relevant, to determinin ere dltwortliii ess .
[0060] Notwithstanding, there are a number of preferred methods for automatically searching sneh a space, indodlng with ut limitation; brute force; simple !hf!-ehm ng (in. which, a computer starts with a random example taction and incrementally- modifies it to bufld a ^better funetion"); simulated arnicaling, a modification of liil-cliiiibing that is guaranteed to always find the best possible triple, given time; general methods- recognized in set theory; or other discrete search methods.
[0061] Still, these methods may not predefine what a "better transformation'* is, or how to .measure how much better one transfbnriation is than another. Thus, implementing such a search., generally calls for both the definition of "better* for the purposes of risk evaluation and. the selection of a computational architecture within which such a search can"be performed, This problem is more appropriatel referred to as "choosing, the appropriate target for a seoring function.**
[0062] Referring back to FIGURE 4, once the final set of meta variables 140 are created, as described above, they are then run through a process of choosing an appropriate target for a. scoring fonctlon 300 by which risk is measured, The preferred method of selection may be accomplished by a machine learning algorithm to select one or more meta variables 140 which are deemed "bet er* or the "best" predictors of risk tlirongh logistic regression, polynomial regression., or a variety of other general and robust optimization schemes* Traditionally, the models have' targeted "ctefanlt ate*', thus simply predicting the probability of future loan default based on the fraction of loans which defaulted over time. However, given, the robust computational power of most modern computers, new model predictors may be preferable in evaluating borrower risk. For example, one could attempt to predict the interval between the time of a missed payment and the time that a loan, is "cured* by the borrower making the delayed payment. However, the results produced by this model are not bounded, and can be quite il!-beliayecL But, by including smoothing and reg larization terms in the objective function being .optimized scores may be fitted tightly, resulting hi a reliable risk function that generalises well to new loans,
[0063] Once a target model (or models) to predict risk has been selected (e„g*? the models 160 ¾.s shown in FIGURE 3), the final step is determinin what part of the scoring function should he optimized, and. how (the method of "building and. validating a. scoring tunction based on the selected, target" 400 as shown in FIGURE 4).
[0064] As further shown in FIGURE 6, the preferred method for building and validating a scoring ftmction 400, includes training a scoring function 420 and feature selection 440.
[0065] Given a set of thousand of past loans, their outcomes, and. a set of features as: described about, one could, in principle,: nse something as simple as linear regression to use any set of numeric features arising from the previous transformations to predict outcomes. One could then analyze the resulting model using standard statistical procedures to find a .submodel thai I not onl accurate, hut also very stable. Tills model could then he used to predict performance on new loans, allowing one to use thi function to decide whether to grant loans to them.
[0066] The preferred, method of training a scoring fxm.eti.oii 20 is b using a statistical or machine learning algorithm. These .-algorithms often, encounter problems' with generalization: the more closel a scorin tunetion can fit the data used to ''train'* it, the less well it will do on data upon which it wasn't trained. While there exist a. number of methods of solvin the "generalization*' problem, three are preferable: (a.) penalt terms: by penalizing the scoring function for being too nnstable, the resnlt forces the selected to he more stable off the trained dataset ; (b) aggregation: by building a scoring ftmetion from the average of several simpler scoring functions, the results is a better tradeoff between flexibility and predictability; and (e) test set reservatinn: by reserving a. portion of t!ie trainin data and using it only to evaluate the scoring -function* one ca estimate the peffornian.ce on untrained data by measuring performance on that reserved set w ieli is,: by virtue of having been withheld, imtrained. data, An alternative method, for resolving the ''generalisation*' problem ma be yielded b using more subtle techni ues, such as eross-validafion» boosted aggregation (bagging), and similar methods, to make better use of the available training data,
[0067] For instance, given a set of thousands, of past loans, one could train up a ΙΒ.Ο<&1 on ail of these, and try to use that model, as a scoring function in the future, .Alternatively, one can split this set up into several pieces and train only on some of them. One can then evaluate the performance of the model on. some or all of the other portions of the training set, and by this means estimate what performance will be on novel loan applications. By selectively retaining or rejecting signals:, one can adjust the behavior of the scoring
■ ftmetion to maximize thi generalization
0068] As shown in FIGURE 6, the second challenge that arises is determining which variables in the RAW DATASETS o>: t ansfo med, data i20; and nieta variables 140 should he selected for the training a scorin taction 420 (the so called '"feature selection* 440 problem). Amongst a number of methods, two non~niutnally exclusive methods are preferable: (a) per feature Information measurement; and (b). two level optimization, [0069] Per feature information, :measurera.ent may include: one or more fast but cru.de training methods (such, as Breituian's ^R ndom Forest") applied to a large set of -variables. Thereafter* a preferred method may include performing the equivalent of an ANOYA to the resulting scoring function to extract, those variables which provide the most information, and thereafter restrict the scope of f be final scoring function to only use those "most mipoirtant" variables,
[0070 j Two level optimization, may include the: discrete search, methods list above: or Holland's Genetic Algorithms, Such, functions serve to combine the training and feature selection, processes and perform them simultaneously* For example, Genetic Algorithms implementation won! d use chromosomes which represented feature sets and would evolve those feature sets to get the best possible generalization on a reserved testing set. As such, the result may permit the use of arbitrarily complicated, features while controlling or variability. [0071] All of the above described methods for the preferred method for building and validating, a scoring function 400 ma utilize significant: processing power. In order to reduce processing time, these methods may be decomposed into layers of "embarrassingly parallel tasks,*5 which, have HO interdependence among or etween themselves. For example, the scoring of each individual mode! in the population of a Genetic Algorithms feature selection process is independent of .all the others, and tints may rua more efficiently on separate machines. Likewise the gathering of selection results may also be assembled on a separate computer to build the next generation of models.
[0072] Any of the above-described processes and methods ma be implemented by any now or hereafter known computing device. For example, the methods may be implemented in such a device via, compnfer- readable instructions embodied in a computer-readable medium such as a computer memory, computer storage device or carrier signal.
[0073] The preceding described embodiment of the invention are provided as illustrations and descriptions. They are not intended to limit the invention to precise form described. In particular-, it is contemplated that functional inipiementation of invention described, herein may be implemented ecjidvaiently in hardware, software, firmware, and/or other available functional compone ts .or building blocks, and that networks may be wired, wireless, or a eoiiibniation of wired and wireless. Other variations and embodiments are possible in light of above teachings, and it Is thus intended that the scope of invention not be limited b this Detailed Description, but rather by Claims ibBowing,