- Notifications
You must be signed in to change notification settings - Fork9
Machine learning models for prediction of chronic homelessness using the HIFIS Application.
License
aildnont/HIFIS-model
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The purpose of this project is deliver a machine learning solution toassist in identifying individuals at risk of chronic homelessness. Amodel was built for the Homeless Prevention division of the City ofLondon, Ontario, Canada. This work was led by the Municipal Artificial Intelligence Applications Lab out of the InformationTechnology Services division. For more information on results of the London project, review ourpre-print article.This repository contains the code used totrain a neural network model to classify clients in the city'sHomeless Individuals and Families InformationSystem(HIFIS) database as either at risk or not at risk of chronichomelessness within a specified predictive horizon. In an effort tobuild AI ethically and anticipate forthcoming federal and provincialregulation of automated decision-making systems, this repository appliesinterpretability and bias-reducing methods to explain the model'spredictions. The model also employs functionality to enable ease ofclient record removal, entire feature removal and audit trails tofacilitate appeals and other data governance processes. This repositoryis intended to serve as a turnkey template for other municipalitiesusing the HIFIS application and HIFIS database schema who wish toexplore the application of this model in their own locales. This modelwas built using data from HIFIS 4.0.57.30.
- Getting Started
- Use Cases
i)Train a model and visualize results
ii)Train multiple models and save the best one
iii)Prediction horizon search experiment
iv)LIME explanations
v)Random hyperparameter search
vi)Batch predictions from raw data
vii)Cross validation
viii)Exclusion of sensitive features
ix)Client clustering experiment (using K-Prototypes) - Time Series Forecasting Model
i)Time series data
ii)RNN-MLP Hybrid Model
iii)Time series LIME explanations
iv)Steps to use - Troubleshooting
- Project Structure
- Project Config
- Azure Machine Learning Pipelines
i)Additional steps for Azure - Contact
- Clone this repository (for help see thistutorial).
- Install the necessary dependencies (listed inrequirements.txt). To do this, open a terminal inthe root directory of the project and run the following:
$ pip install -r requirements.txt
- Openretrieve_raw_data.ps1 forediting. Replace "[Instance Name goes here]" with your HIFISdatabase instance name. Executeretrieve_raw_data.ps1. A filenamed"HIFIS_Clients.csv" should now be within thedata/raw/folder. SeeHIFIS_Clients_example.csv foran example of the column names in our"HIFIS_Clients.csv" (notethat the data is fabricated; this file is included for illustrativepurposes).
- Check that your features inHIFIS_Clients.csv match inconfig.yml. If necessary, update featureclassifications in this file (for help seeProject Config).
- Executepreprocess.py to transform thedata into the format required by the machine learning model.Preprocessed data will be saved withindata/preprocessed/.
- Executetrain.py to train the neural network modelon your preprocessed data. The trained model weights will be savedwithinresults/models/, and its filename will resemble thefollowing structure: modelyyyymmdd-hhmmss.h5, where yyyymmdd-hhmmssis the current time. TheTensorBoard log files willbe saved withinresults/logs/training/.
- Inconfig.yml, setMODEL_TO_LOAD withinPATHS tothe path of the model weights file that was generated in step 6 (for help seeProject Config).Executelime_explain.py togenerate interpretable explanations for the model's predictions onthe test set. A spreadsheet of predictions and explanations will besaved withinresults/experiments/.
- Once you haveHIFIS_Clients.csv sitting in the raw data folder(_data/raw/), executepreprocess.py. SeeGetting Started for help obtainingHIFIS_Clients.csv.
- Ensure data has been preprocessed properly. That is, verify thatdata/processed/ contains bothHIFIS_Processed.csv andHIFIS_Processed_OHE.csv. The latter is identical to the former withthe exception being that its single-valued categorical features havebeen one-hot encoded.
- Inconfig.yml, setEXPERIMENT withinTRAIN to'single_train'.
- Executetrain.py. The trained model's weights will belocated inresults/models/, and its filename will resemble thefollowing structure: modelyyyymmdd-hhmmss.h5, where yyyymmdd-hhmmssis the current time. The model's logs will be located inresults/logs/training/, and its directory name will be the currenttime in the same format. These logs contain information about theexperiment, such as metrics throughout the training process on thetraining and validation sets, and performance on the test set. Thelogs can be visualized by runningTensorBoard locally. Seebelow for an example of a plot from a TensorBoard log file depictingloss on the training and validation sets vs. epoch. Plots depictingthe change in performance metrics throughout the training process(such as the example below) are available in theSCALARS tab ofTensorBoard.
You can also visualize the trained model's performance on the testset. See below for an example of the ROC Curve and Confusion Matrixbased on test set predictions. In our implementation, these plots areavailable in theIMAGES tab of TensorBoard.
The diagram below depicts an overview the model's architecture. We callthis model"HIFIS MLP", as the model is an example of a multilayerperceptron.NODES0 andNODES1 correspond to hyperparametersconfigurable inconfig.yml.
Not every model trained will perform at the same level on the test set.This procedure enables you to train multiple models and save the onethat scored the best result on the test set for a particular metric thatyou care about optimizing.
- Follow steps 1 and 2 inTrain a model and visualize results.
- Inconfig.yml, setEXPERIMENT withinTRAIN to'multi_train'.
- Decide which metrics you would like to optimize and in what order. Inconfig.yml, setMETRIC_PREFERENCE withinTRAIN toyour chosen metrics, in order from most to least important. Forexample, if you decide to select the model with the best recall onthe test set, set the first element in this field to'recall'.
- Decide how many models you wish to train. Inconfig.yml, setNUM_RUNS withinTRAIN to yourchosen number of training sessions. For example, if you wish to train10 models, set this field to10.
- Executetrain.py. The weights of the model that hadthe best performance on the test set for the metric you specifiedwill be located inresults/models/training/, and its filename willresemble the following structure: modelyyyymmdd-hhmmss.h5, whereyyyymmdd-hhmmss is the current time. The model's logs will be locatedinresults/logs/training/, and its directory name will be thecurrent time in the same format.
Since the predictions made by this model are to be used by a governmentinstitution to benefit vulnerable members of society, it is imperativethat the model's predictions may be explained so as to facilitateensuring the model is making responsible predictions, as well asassuring transparency and accountability of government decision-makingprocesses. Since this model is a neural network, it is difficult todecipher which rules or heuristics it is employing to make itspredictions. Interpretability in machine learning is a growing concern,especially with applications in the healthcare and social servicesdomains. We usedLocal InterpretableModel-Agnostic Explanations (i.e.LIME) to explain the predictions of the neural network classifier thatwe trained. We used the implementation available in the authors'GitHubrepository. LIME perturbs thefeatures in an example and fits a linear model to approximate the neuralnetwork at the local region in the feature space surrounding theexample. It then uses the linear model to determine which features weremost contributory to the model's prediction for that example. Byapplying LIME to our trained model, we can conduct informed featureengineering based on any obviously inconsequential features we see (e.g.EyeColour) or insights from domain experts. We can also tell if themodel is learning any unintended bias and eliminate that bias throughadditional feature engineering. See the steps below to apply LIME toexplain the model's predictions on examples in the test set.
- Having previously runtrain.py, ensure thatdata/processed/ contains bothTrain_Set.csv andTest_Set.csv.
- Inconfig.yml, setMODEL_TO_LOAD withinPATHS tothe path of the model weights file (.h5 file) that you wish to usefor prediction.
- By setting the appropriate value in theEXPERIMENT field ofLIMEinconfig.yml, you can select to either(i) performa LIME experiment on the test set,(ii) perform a submodularpick, or(iii) run LIME on 1 test set example. Once theappropriate field is set, executelime_explain.py.
- By setting theEXPERIMENT field ofLIME inconfig.yml to'lime_experiment', you will run LIMEon all examples in the test set, create a .csv file of theresults, and produce a visualization of the average explainablefeature rules. The .csv file will be located inresults/experiments/, and will be calledlime_experimentyyyymmdd-hhmmss.csv, where yyyymmdd-hhmmss is thecurrent time. The visualization will be located indocuments/generated_images/, and will be calledLIME_Eplanations_yyyymmdd-hhmmss.csv.
- By setting theEXPERIMENT field ofLIME inconfig.yml to'submodular_pick', you will run thesubmodular pick algorithm (as described in theLIME paper) to pick andamalgamate a set explanations of training set examples thatattempt to explain the model's functionality as a whole. Globalsurrogate explanations and weights will be saved to a .csv fileand depicted in a visualization. The .csv file will be located inresults/experiments/, and will be calledlime_submodular_pick.csv. Subsequent submodular picks will beappended to this file with timestamps. The visualization will belocated indocuments/generated_images/, and will be calledLIME_Submodular_Pick_yyyymmdd-hhmmss.csv.
- By setting theEXPERIMENT field ofLIME inconfig.yml to'explain_client', you will run LIMEon the example in the test set whose ClientID is that which youpassed to the function. You will have to set the Client ID of aclient who is in the test set in the main function oflime_explain.py (itcurrently reads
client_id = lime_dict['Y_TEST'].index[0][0]
,which selects the first ClientID in the test set). An image willbe generated that depicts the top explainable features that themodel used to make its prediction. The image will be automaticallysaved indocuments/generated_images/, and its filename willresemble the following:Client_client_id_exp_yyyymmdd-hhmmss.png. See below for anexample of this graphic.
- Interpret the output of the LIME explainer. LIME partitions featuresinto classes or ranges and reports the features most contributory toa prediction. A feature explanation is considered to be a value (orrange of values) of a feature and its associated weight in theprediction. In the example portrayed by the bar graph below, the factthatTotalStays was greater than 4 but less than or equal to 23contributed negatively with a magnitude of about 0.22 to a positiveprediction (meaning it contributed at a magnitude of 0.22 toward anegative prediction). As another example, the rule"ReasonForService_Streets=1" indicates that at some point the clienthas a record that cites their reason for service as"Streets" (=1indicates that a Boolean feature is present, and=0 indicates thata Boolean feature is not present) and that this explanationcontributed with a weight of about 0.02 toward a positive prediction.As one last example, consider that this client'sAboriginalIndicatorvalue is"Yes - Tribe Not Known", which contributed with a weight ofabout 0.04 towards a negative prediction.
NOTE: Many clients have incomplete records. To represent missing values, default values are inserted into the dataset. You may see these values when examining LIME explanations.
- Missing records for numerical features are given a value of-1
- Missing records for categorical features are given a value of"Unknown"
Hyperparameter tuning is an important part of the standard machinelearning workflow. We chose to conduct a series of random hyperparametersearches. The results of one search informed the next, leading us toeventually settle on the hyperparameters currently set in theTRAINandNN sections ofconfig.yml. We applied TensorBoardvisualization to aid the random hyperparameter search. With the help oftheHParamDashboard,one can see the effect of different combinations of hyperparameters onthe model's test set performance metrics.
In our random hyperparameter search, we study the effects ofx randomcombinations of hyperparameters by training the modely times for eachof thex combinations and recording the results. See the steps belowon how to conduct a random hyperparameter search. Note that if you arenot planning on changing the hyperparameters or their ranges, you mayskip steps 2-4, as a default set of hyperparameter ranges is alreadydefined in code.
- In the in theHP subsection of theTRAIN section ofconfig.yml, set the number of random combinations ofhyperparameters you wish to study and the number of times you wouldlike to train the model for each combination (seeProject Config for help).
COMBINATIONS: 60REPEATS: 2
- Set the ranges of hyperparameters you wish to study in theHPsubsection of theTRAIN section ofconfig.yml. Theconfig file already has a comprehensive set of hyperparameter rangesdefined (as shown below), so you may not need to change anything inthis step. Consider whether your hyperparameter ranges are continuous(i.e. real) or discrete and whether any need to be investigated onthe logarithmic scale.
NODES0: [80, 100] # Discrete range NODES1: [60, 80] # Discrete range LAYERS: [2, 3, 4] # Discrete range DROPOUT: [0.2, 0.4] # Real range LR: [-3.0, -3.0] # Real range on logarithmic scale (10^x) OPTIMIZER: ['adam', 'sgd'] # Discrete range BETA_1: [-1.0, -1.0] # 1st moment for Adam. Real range on log scale (1 - 10^x) BETA_2: [-3.0, -3.0] # 2nd moment for Adam. Real range on log scale (1 - 10^x) L2_LAMBDA: [-2.0, -2.0] # Real range on log scale (10^x) BATCH_SIZE: [128, 256] # Discrete range POS_WEIGHT: [0.6, 0.7] # Weight multiplier for positive class. Real range IMB_STRATEGY: ['class_weight', 'smote', 'random_oversample'] # Discrete range
- Within therandom_hparam_search() function defined intrain.py, ensure your hyperparameters are added asHParam objects to the list of hyperparameters being considered. Thebelow code is already included intrain.py; onlychange it if you have changed the default hyperparameter ranges inconfig.yml.
HPARAMS.append(hp.HParam('NODES0', hp.Discrete(hp_ranges['NODES0'])))HPARAMS.append(hp.HParam('NODES1', hp.Discrete(hp_ranges['NODES1'])))HPARAMS.append(hp.HParam('LAYERS', hp.Discrete(hp_ranges['LAYERS'])))HPARAMS.append(hp.HParam('DROPOUT', hp.RealInterval(hp_ranges['DROPOUT'][0], hp_ranges['DROPOUT'][1])))HPARAMS.append(hp.HParam('L2_LAMBDA', hp.RealInterval(hp_ranges['L2_LAMBDA'][0], hp_ranges['L2_LAMBDA'][1])))HPARAMS.append(hp.HParam('LR', hp.RealInterval(hp_ranges['LR'][0], hp_ranges['LR'][1])))HPARAMS.append(hp.HParam('BETA_1', hp.RealInterval(hp_ranges['BETA_1'][0], hp_ranges['BETA_1'][1])))HPARAMS.append(hp.HParam('BETA_2', hp.RealInterval(hp_ranges['BETA_2'][0], hp_ranges['BETA_2'][1])))HPARAMS.append(hp.HParam('OPTIMIZER', hp.Discrete(hp_ranges['OPTIMIZER'])))HPARAMS.append(hp.HParam('BATCH_SIZE', hp.Discrete(hp_ranges['BATCH_SIZE'])))HPARAMS.append(hp.HParam('POS_WEIGHT', hp.RealInterval(hp_ranges['POS_WEIGHT'][0], hp_ranges['POS_WEIGHT'][1])))HPARAMS.append(hp.HParam('IMB_STRATEGY', hp.Discrete(hp_ranges['IMB_STRATEGY'])))
- If you have added other hyperparameters not listed already inconfig.yml, ensure that you set the hyperparametersbased on the random combination in eithermodel.py ortrain.py.
- Inconfig.yml, setEXPERIMENT within theTRAIN section to'hparam_search'.
- Executetrain.py. The experiment's logs will belocated inresults/logs/hparam_search/, and the directory name willbe the current time in the following format:yyyymmdd-hhmmss. Theselogs contain information on test set metrics with models trained ondifferent combinations of hyperparameters. The logs can be visualizedby runningTensorBoardlocally. See below for an example of a view offered by the HParamsdashboard of TensorBoard. Each point represents 1 training run. Thegraph compares values of hyperparameters to test set metrics.
Once a trained model is produced, the user may wish to obtainpredictions and explanations for all clients currently in the HIFISdatabase. As clients' life situations change over time, their records inthe HIFIS database change as well. Thus, it is useful to rerunpredictions for clients every so often. If you wish to track changes inpredictions and explanations for particular clients over time, you canchoose to append timestamped predictions to a file containing previoustimestamped predictions. The steps below detail how to run predictionfor all clients, given raw data from HIFIS and a trained model.
- Ensure that you have already runlime_explain.py aftertraining your model, as it will have generated and saved aLIME Explainer object atdata/interpretability/lime_explainer.pkl.
- Ensure that you haveHIFIS_Clients.csv located within in the rawdata folder (data/raw/). SeeGetting Startedfor help obtainingHIFIS_Clients.csv.
- Inconfig.yml, setMODEL_TO_LOAD withinPATHS tothe path of the model weights file (.h5 file) that you wish to usefor prediction.
- By changing the value of theEXPERIMENT field of thePREDICTIONsection ofconfig.yml to your desired experiment, youcan opt to either(i) save predictions to a new file or(ii)append predictions and their corresponding timestamps to a filecontaining past predictions.
- Setting theEXPERIMENT field of thePREDICTION section ofconfig.yml to'batch_prediction' will preprocessraw client data, run prediction for all clients, and run LIME toexplain these predictions. Results will be saved in a .csv file,which will be located inresults/predictions/, and will becalledpredictionsyyyymmdd-hhmmss.csv, where yyyymmdd-hhmmss isthe current time.
- Setting theEXPERIMENT field of thePREDICTION section ofconfig.yml to'trending_prediction' will producepredictions and explanations in the same method as described in(i), but will include timestamps for when the predictions weremade. The results will be appended to a file calledtrending_predictions.csv, located withinresults/prediction/.This file contains predictions made at previous times, enablingthe user to compare the change in predictions and explanations forparticular clients over time.
- Executepredict.py.
Cross validation helps us select a model that is as unbiased as possibletowards any particular dataset. By using cross validation, we can beincreasingly confident in the external validity of our results. Thisrepository offers a means to run cross validation for both modelsdefined. We include cross validation as a training experiment.
Note that the cross validation experiment differs for the HIFIS-MLPmodel and the HIFIS-RNN-MLP. K-fold cross validation is used in the caseof HIFIS-MLP, whereas nested cross-validation with day-forward chainingis used for HIFIS-RNN-MLP. The difference in our cross validationalgorithms is a result of the different types of data used for bothscenarios. In HIFIS-MLP, data is randomly partitioned by ClientID intotraining/validation/test sets. Since HIFIS-RNN-MLP learns from timeseries data, the validation and test sets are taken to be thesecond-most and most recent partitions of data respectively (and aretherefore not random). Nested cross validation is a commonly usedstrategy for time series data.
To run cross validation, see the steps below:
- Follow steps 1 and 2 inTrain a model and visualize results.
- Inconfig.yml, setEXPERIMENT withinTRAIN to'cross_validation'.
- Executetrain.py. A model will be trained for eachtrain/test fold. A CSV will be generated that reports the performancemetrics on the test sets for each fold, along with the mean andstandard deviation of metrics for all folds. The file will be locatedinresults/experiments/, and its filename will resemble thefollowing structure:kFoldCVyyyymmdd-hhmmss.csv, if you aretraining the HIFIS-MLP model (which is the default). If you aretraining the HIFIS-RNN-MLP model, the file will be callednestedCVyyyymmdd-hhmmss.csv.
The prediction horizon (N) is defined as the amount of time from nowthat the model makes its predictions for. In our case, the predictionhorizon is how far in the future (in weeks) the model is predicting riskof chronic homelessness. For example, if theN = 26 weeks, then themodel is predicting whether or not a client will be at risk of chronichomelessness in 26 weeks. While developing this model, we noticed thatthe model's performance is inversely correlated with the predictionhorizon. The Prediction Horizon Search Experiment conductscrossvalidation at multiple values ofN. For each value ofN, thedata is retrospectively preprocessed by cutting off the most recentNweeks of records. The relationships ofN and several model metrics aregraphed for the user to deliver insight on the impact ofN and make abusiness decision as to which value yields optimal results. See belowfor instructions on how to run a Prediction Horizon Search Experiment.
- In theHORIZON_SEARCH section ofconfig.yml, setN_MIN,N_MAX,N_INTERVAL andRUNS_PER_N according to yourorganization's needs (seeProject Config for help).
- Runsrc/horizon_search.py. This may take several minutes to hours,depending on your hardware and settings from the previous step.
- A .csv representation of experiment results will be available withinresults/experiments/, calledhorizon_searchyyyymmdd-hhmmss.csv,where yyyymmdd-hhmmss is the current time. A graphical representationof the results will be available withindocuments/generated_images/, calledhorizon_experiment_yyyymmdd-hhmmss.png. See below for an example ofthis visualization.
Depending on your organization's circumstances, you may wish to excludespecific HIFIS features that you consider sensitive due to legal,ethical or social reasons. A simply way to ensure that a feature doesnot cause bias in the model is to avoid including it as a feature of themodel. This project supports the exclusion of specific features bydropping them at the start of data preprocessing. See the below stepsfor details on how to accomplish this:
- Having completed steps 1-3 inGetting Started,open the raw HIFIS data (i.e."HIFIS_Clients.csv"), which shouldbe within thedata/raw/ folder.
- Features that the model will be trained on correspond to the columnnames of this file. Decide which features you wish to exclude frommodel training. Take note of these column names.
- Openconfig.yml for editing. Add the column names thatyou decided on during step 2 to the list of features detailed in theFEATURES_TO_DROP_FIRST field of theDATA section ofconfig.yml (for more info seeProject Config).
We were interested in investigating whether HIFIS client data could beclustered. Since HIFIS consists of numerical and categorical data, andin the spirit of minimizing time complexity,k-prototypeswas selected as the clustering algorithm. We wish to acknowledge Nico deVos, as we made use of hisimplementation of k-prototypes. Thisexperiment runs k-prototypes a series of times and selects the bestclustering (i.e. least average dissimilarity between clients and theirassigned clusters' centroids). Our intention is that by examiningclusters and obtaining their LIME explanations, one can gain furtherinsight into patterns in the data and how the model behaves in differentscenarios. By following the steps below, you can cluster clients into adesired number of clusters, examine the clusters' centroids, and viewLIME explanations of these centroids.
- Ensure that you have already runlime_explain.py aftertraining your model, as it will have generated and saved a LIMEExplainer object atdata/interpretability/lime_explainer.pkl.
- Ensure that you have a .CSV file of preprocessed data located withinin the processed data folder (data/processed/). SeeGetting Started for help.
- Set theEXPERIMENT field of theK-PROTOTYPES section ofconfig.yml to'cluster_clients'.
- Runcluster.py. ConsultProject Config before changing default clusteringparameters inconfig.yml.
- 3 separate files will be saved once clustering is complete:
- A spreadsheet depicting cluster assignments by Client ID will belocated atresults/experiments/, and it will be calledclient_clusters_yyyymmdd-hhmmss.csv (where yyyymmdd-hhmmss isthe current time).
- Cluster centroids will be saved to a spreadsheet. Centroids havethe same features as clients and their LIME explanations will beappended to the end by default. The spreadsheet will be located atresults/experiments/, and it will be calledcluster_centroids_yyyymmdd-hhmmss.csv.
- A graphic depicting the LIME explanations of all centroids will belocated atdocuments/generated_images/, and it will be calledcentroid_explanations_yyyymmdd-hhmmss.png.
A tradeoff for the efficiency of k-prototypes is the fact that thenumber of clusters must be specified a priori. In an attempt todetermine the optimal number of clusters, theAverage Silhouette Methodwas implemented. During this procedure, different clusterings arecomputed for a range of values ofk. A graph is produced that plotsaverage Silhouette Score versusk. The higher average Silhouette Scoreis, the more optimalk is. To run this experiment, see the belowsteps.
- Follow steps 1 and 2 as outlined above in the clusteringinstructions.
- Set theEXPERIMENT field of theK-PROTOTYPES section ofconfig.yml to'silhouette_analysis'. At this time,you may wish to change theK_MIN andK_MAX field of theK-PROTOTYPES section from their defaults. Values ofk in theinteger range of[K_MIN, K_MAX] will be investigated.
- Runcluster.py. An imagedepicting a graph of average Silhouette Score versusk will besaved todocuments/generated_images/, and it will be calledsilhouette_plot_yyyymmdd-hhmmss.png. Upon visualizing this graph,note that a larger average Silhouette Score implies a more qualityclustering.
Later research involved developing a model that reformulates clientservice usage as time series features. The motivation behind theinvestigation of time series forecasting was to discover if capturingservice usage changes over time would better reflect the episodic natureof homelessness, thereby improving model predictions for clients atdifferent times. The hope was that time series service features wouldgive further context to a client's story that could be formalized as anexample. This section will describe the changes in the features,dataset, model, and explanations.
Features that describe client service usage (e.g. stays, casemanagement, food bank visits) are quantified over time. In the originalHIFIS MLP model, these features were totalled up to the date of theexample. Here, we define atimestep (by default, 30 days) and includetotal service usage over a timestep as a feature. Theinput sequencelength (i.e.T_X) defines how many of the most recent timesteps toinclude in a single client record. For instance, suppose that thetimestep is 30 days andT_X is 6. A single client record will containtotal service usage features for that client at a particular timestamp,as well as time series service features corresponding to total serviceusage during each of the 6 most recent timesteps.
An additional benefit of formulating time series client records was thatthe aggregate dataset can contain records for clients at differentdates. Therefore, the dataset is indexed by ClientID and Date. Duringdata preprocessing, records are calculated for each client at differentdates. The result is a significantly larger dataset than was used forthe MLP model.
Since the dataset contains different records for each client at variousdates, the training, validation and test sets are partitioned by Date,instead of randomly by ClientID. The test and validation sets comprisethe most recent and second most recent records for all clients. Thetraining set is taken to be all records with dates earlier than those inthe test and validation sets. This way, the model is tested on the mostrecent client data and is likely to perform well on new client data. Dueto the non-random partitions of training, validation and test sets, thecross validation experiment used in this repository for time series datais nested cross validation with day-forward chaining (seeCrossValidation for more info).
The time series forecasting model is different than that of the firstmodel described. The first iteration of the HIFIS model was amulti-layer preceptron (MLP). The time series forecasting model wedeveloped has a hybrid recurrent neural network (RNN) and MLParchitecture. This model, dubbed"RNN-MLP model", captures the stateof time series features by incorporating an LSTM layer through which thetime series features are fed. The static features (i.e. non-time-seriesfeatures) are concatenated with the output of the LSTM layer, and fedinto an MLP, whose output is the model's decision. Examples of staticfeatures include demographic attributes. Total service features are alsoincluded in this group, as they capture a client's service usage sincethe beginning of their inclusion in HIFIS. See below for a diagramsummarizing the RNN-MLP's architecture.
Explanations are computed for the RNN-MLP model in the same way as theywere for the original MLP model, except that they are computed forpredictions for a particular client at a particular date. We found thattime series features (especially those of total stays during differenttimesteps) appeared more often as being important in explanations. Thetime series features are named for their position in the input sequenceand the duration of the timestep. For example, a feature called"(-2)30-Day_Stay" indicates the total number of stays that a clienthad 2 timesteps ago, where the timestep duration is 30 days.Additionally, stable explanations take longer to compute for the RNN-MLPmodels.
- Inconfig.yml, setMODEL_DEF withinTRAIN to'hifis_rnn_mlp'.
- Follow steps 1-4 inTrain a model and visualize resultsto preprocess the raw data and train a model.
- See the steps inLIME Explanations forinstructions on how to run different explainability experiments. Ifyou wish to explain a single client, be sure that you pass a valuefor thedate parameter toexplain_single_client() in theyyyy-mm-dd format.
Below are some common error scenarios that you may experience if youapply this repository to your municipality's HIFIS database, along withpossible solutions.
- KeyError: "['MyFeature'] not found in axis"
Commonly occurs if you have specified a feature in one of the lists oftheDATA seection ofconfig.yml that does not exist asa column in the CSV of raw data extracted from the HIFIS database.This could occur if one of the default features in those lists doesnot exist in your HIFIS database. To fix this, remove the feature (inthis case called 'MyFeature') from the appropriate list inconfig.yml. - A feature in your database is missing in the preprocessed data CSV
All features are either classified as noncategorical or categorical.You must ensure that the lists defined atDATA >NONCATEGORICAL_FEATURES andDATA > CATEGORICAL_FEATURES (inconfig.yml) include the column names in your raw datathat you wish to use as features for the model. - Incorrect designation of feature as noncategorical vs. categorical
The lists defined inconfig.yml atDATA >NONCATEGORICAL_FEATURES andDATA > CATEGORICAL_FEATURES mustcorrectly classify features as noncategorical or categorical. Strangeerrors may be encountered during execution ofvec_multi_value_cat_features() during preprocessing if these are setincorrectly. Remember that categorical features can take on one of afinite amoount of possible values; whereas, noncategorical featuresare numeric variables whose domains exist within the real numbers. Forexample,'Citizenship' is a categorical feature and'CurrentWeightKG' is a noncategorical feature. - File "preprocess.py", line 443, in assemble_time_sequencesIndexError: list index out of range
This error can indicate that your raw data does not go as far back intime as needed to produce preprocessed data, given the parameters thatyou set. It is important to ensure that the sum of the predictionhorizon and the length of time covered by each time series example isless than the total time over which raw data was collected. Forexample, if the prediction horizon is 26 weeks (found atDATA >N_WEEKS inconfig.yml], time step length is 30 days(found atDATA > TIME_SERIES > TIME_STEP), and input sequence lengthis 6 (found at _DATA > TIME_SERIES >T_X), then you should have atleast26×7 + 30×6 = 362 days of raw HIFIS data. Note that this isthe very minimum - you should have significantly more days of HIFISdata than this value in order to train an effective model. Finally, Ifpreprocessing causes your resultant dataset to be small, you mayencounter poor results when training a model; therefore, consider thatthe absence of this error does not guarantee that you have enough rawdata. - OSError: SavedModel file does not exist at:results/models/model.h5/{saved_model.pbtxt|saved_model.pb}
This common error, experienced when attempting to load model weightsfrom disk from a non-existent file path, may can occur in eitherlime_explain.py,predict.py, orcluster.py. The model weights' pathis set at thePATHS > MODEL_TO_LOAD field ofconfig.yml. You must change its default value from'model.h5' to the filename of a model weights file that exists inresults/models/. Note that trained models are automatically savedwith a filename following the convention'modelyyyymmdd-hhmmss.h5',whereyyyymmdd-hhmmss is a datetime. - Out-of-memory error during LIME submodular pick
Submodular pick involves generating LIME explanations for a largenumber of examples in the training set. The LIME package used in thisrepository tends to use larger amounts of memory than is required totemporarily store the list of explanations accumulated during asubmodular pick experiment. Generating too high of a number ofexplanations during this experiment can cause out-of-memory errors,depending on available RAM. The only known fix is to decrease thefraction of training set examples to use during submodular pick. Thisvalue may be found at theLIME > SP > SAMPLE_FRACTION field ofconfig.yml. To illustrate this issue, we had to set thisvalue to 0.2 when running submodular pick on a training set ofapproximately 90000 records on a virtual machine with 56 GiB of RAM.
The project looks similar to the directory structure below. Disregardany.gitkeep files, as their only purpose is to force Git to trackempty directories. Disregard any._init_.py files, as they areempty files that enable Python to recognize certain directories aspackages.
├── azure <- folder containing Azure ML pipelines├── data│ ├── interpretability <- Generated feature information│ ├── processed <- Products of preprocessing│ ├── raw <- Raw data from SQL query│ └── transformers <- Serialized sklearn transformers|├── documents| ├── generated_images <- Visualizations of model performance, experiments| └── readme_images <- Image assets for README.md├── results│ ├── experiments <- Experiment results│ ├── logs <- TensorBoard logs│ ├── models <- Trained model weights│ └── predictions <- Model predictions and explanations|├── src│ ├── custom <- Custom TensorFlow components| | └── metrics.py <- Definition of custom TensorFlow metrics│ ├── data <- Data processing| | ├── queries| | | ├── client_export.sql <- SQL query to get raw data from HIFIS database| | | └── SPDAT_export.sql <- SQL query to get SPDAT data from database| | ├── preprocess.py <- Main preprocessing script| | └── spdat.py <- SPDAT data preprocessing script│ ├── interpretability <- Model interpretability scripts| | ├── cluster.py <- Script for learning client clusters| | ├── lime_base.py <- Modified version of file taken from lime package| | ├── lime_explain.py <- Script for generating LIME explanations| | ├── lime_tabular.py <- Modified version of file taken from lime package| | └── submodular_pick.py <- Modified version of file taken from lime package│ ├── models <- TensorFlow model definitions| | └── models.py <- Script containing model definition| ├── visualization <- Visualization scripts| | └── visualize.py <- Script for visualizing model performance metrics| ├── horizon_search.py <- Script for comparing different prediction horizons| ├── predict.py <- Script for prediction on raw data using trained models| └── train.py <- Script for training model on preprocessed data|├── .gitignore <- Files to be be ignored by git.├── config.yml <- Values of several constants used throughout project├── config_private.yml <- Private information, e.g. database keys (not included in repo)├── LICENSE <- Project license├── README.md <- Project description├── requirements.txt <- Lists all dependencies and their respective versions└── retrieve_raw_data.ps1 <- Powershell script that executes SQL queries to get raw data from HIFIS database
Many of the components of this project are ready for use on your HIFISdata. However, this project contains several configurable variables thatare defined in the project config file:config.yml. Whenloaded into Python scripts, the contents of this file become adictionary through which the developer can easily access its members.
For user convenience, the config file is organized into major steps inour model development pipeline. Many fields need not be modified by thetypical user, but others may be modified to suit the user's specificgoals. A summary of the major configurable elements in this file isbelow.
- RAW_DATA: Path to .csv file generated by runningClientExport.sql
- RAW_SPDAT_DATA: Path to .json file containing client SPDAT data
- MODEL_WEIGHTS: Base path at which to save trained model's weights
- MODEL_TO_LOAD: Path to trained model's weights that you would liketo load for prediction
- N_WEEKS: The number of weeks in the future the model will bepredicting the probability of chronic homelessness (i.e. predictivehorizon)
- GROUND_TRUTH_DATE: Date at which to compute ground truth (i.e.state of chronic homelessness) for clients. Set to either'today' ora date with the following format:'yyyy-mm-dd'.
- CHRONIC_THRESHOLD: Number of stays per year for a client to beconsidered chronically homeless
- CLIENT_EXCLUSIONS: A list of Client IDs (integers) that specifiesclients who did not provide consent to be included in this project.Records belonging to these clients will be automatically removed fromthe raw dataset prior to preprocessing.
- FEATURES_TO_DROP_FIRST: Features you would like to excludeentirely from the model. For us, this list evolved through trial anderror. For example, after running LIME to produce predictionexplanations, we realized that features in the database that shouldhave no impact on the ground truth (e.g.EyeColour) were appearingin some explanations; thus, they were added to this list so that theseproblematic correlations and inferences would not be made by themodel. Incidentally, this iterative feature engineering using LIME(explainable AI) to identify bad correlations is the foundation ofensuring a machine learning model is free of bias and that itspredictions are valuable.
- IDENTIFYING_FEATURES_TO_DROP_LAST: A list of features that areused to preprocess data but are eventually excluded from thepreprocessed data, as the model cannot consume them. You will notlikely have to edit this unless you have additional data featureswhich are not noted in our config file.
- TIMED_FEATURES_TO_DROP_LAST: A list of features containing datesthat are used in preprocessing but are eventually excluded from thepreprocessed data. Add any features describing a start or end date tothis list (e.g.'LifeEventStartDate','LifeEventEndDate')
- TIMED_EVENT_FEATURES: A dictionary where each key is a timestampfeature (e.g.'LifeEventStartDate') and every value is a list offeatures that are associated with the timestamp. For example, the'LifeEvent' feature is associated with the'LifeEventStartDate'feature. For paired start and end features, include 1 entry in thisdictionary for associated features. For example, you need to includeonly one of'LifeEventStartDate' and'LifeEventEndDate' as a key,along with['LifeEvent'] as the associated value.
- TIMED_EVENTS: A list of columns that correspond to significantevents in a client's history (e.g. health issues, life events, servicerestrictions)
- TIMED_SERVICE_FEATURES: Services received by a client over aperiod of time to include as features. The feature value is calculatedby summing the days over which the service is received. Note that theservices offered in your locale may be different, necessitatingmodification of this list.
- COUNTED_SERVICE_FEATURES: Services received by a client at aparticular time to include as features. The feature value iscalculated by summing the number of times that the service wasaccessed. Note that the services offered in your locale may bedifferent, necessitating modification of this list.
- KFOLDS: Number of folds for k-fold cross validation
- TIME_SERIES: Parameters associated with time series data
- TIME_STEP: Length of time step in days
- T_X: Length of input sequence length to LSTM layer. Alsodescribed as the number of past time steps to include with eachexample
- YEARS_OF_DATA: Number of recent years over which to constructtime series records for. Recall that time series examples areindexed by ClientID and Date.
- FOLDS: Number of folds for nested cross validation withday-forward chaining
- SPDAT: Parameters associated with addition of client SPDAT data
- INCLUDE_SPDATS: Boolean variable indicating whether to includeSPDAT data during preprocessing
- SPDAT_CLIENTS_ONLY: Boolean variable indicating whether toinclude only clients who have a SPDAT in the preprocessed dataset
- SPDAT_DATA_ONLY: Boolean variable indicating whether to includeonly answers to SPDAT questions in the preprocessed dataset
- HIFIS_MLP: Contains definitions of configurable hyperparametersassociated with the HIFIS MLP model architecture. The values currentlyin this section were the optimal values for our dataset informed by arandom hyperparameter search.
- HIFIS_RNN_MLP: Contains definitions of configurablehyperparameters associated with the HIFIS RNN-MLP model architecture.This model is to be trained with time series data. The valuescurrently in this section were the optimal values for our datasetinformed by a random hyperparameter search.
- EXPERIMENT: The type of training experiment you would like toperform if executingtrain.py. Choices are'single_train','multi_train', or'hparam_search'.
- MODEL_DEF: The model architecture to train. Set to'hifis_mlp'to train the HIFIS MLP model, or set to'hifis_rnn_mlp' to train theHIFIS RNN-MLP hybrid model. Also dictates how the raw HIFIS data willbe preprocessed.
- TRAIN_SPLIT, VAL_SPLIT, TEST_SPLIT: Fraction of the data allocatedto the training, validation and test sets respectively. These fieldsmust collectively sum to 1.
- EPOCHS: Number of epochs to train the model for
- BATCH_SIZE: Mini-batch size during training
- POS_WEIGHT: Coefficient to multiply the positive class' weight byduring computation of loss function. Negative class' weight ismultiplied by (1 - POS_WEIGHT). Increasing this number tends toincrease recall and decrease precision.
- IMB_STRATEGY: Class imbalancing strategy to employ. In ourdataset, the ratio of positive to negative ground truth was very low,prompting the use of these strategies. Set either to'class_weight','random_oversample','smote', or'adasyn'.
- METRIC_PREFERENCE: A list of metrics in order of importance (fromleft to right) to guide selection of the best model after trainingmultiple models in series (i.e. the'multi_train'experiment intrain.py)
- NUM_RUNS: The number of times to train a model in the'multi_train' experiment
- THRESHOLDS: A single float or list of floats in range [0, 1]defining the classification threshold. Affects precision and recallmetrics.
- HP: Parameters associated with random hyperparameter search
- METRICS: List of metrics on validation set to monitor inhyperparameter search. Can be any combination of{'accuracy','loss', 'recall', 'precision', 'auc'}
- COMBINATIONS: Number of random combinations of hyperparametersto try in hyperparameter search
- REPEATS: Number of times to repeat training per combination ofhyperparameters
- RANGES: Ranges defining possible values that hyperparameters maytake. Be sure to checktrain.py to ensure thatyour ranges are defined correctly as real or discrete intervals (seeRandom Hyperparameter Search for anexample).
Note that the following fields have separate values for theHIFIS_MLP andHIFIS_RNN_MLP architectures:KERNEL_WIDTH,FEATURE_SELECTION,NUM_FEATURES,NUM_SAMPLES, andMAX_DISPLAYED_RULES.
- KERNEL_WIDTH: Affects size of neighbourhood around which LIMEsamples for a particular example. In our experience, setting thiswithin the continuous range of[1.0, 2.0] is large enough to producestable explanations, but small enough to avoid producing explanationsthat approach a global surrogate model. This field is numeric;however, it will be set to a default kernel width if a string is set.
- FEATURE_SELECTION: The strategy to select features for LIMEexplanations. Read the LIME creators'documentationfor more information.
- NUM_FEATURES: The number of features toinclude in a LIME explanation
- NUM_SAMPLES: The number of samplesused to fit a linear model when explaining a prediction using LIME
- MAX_DISPLAYED_RULES: The maximum number of explanations to beincluded in a global surrogate visualization
- SP: Parameters associated with submodular pick
- SAMPLE_FRACTION: A float in the range[0.0, 1.0] thatspecifies the fraction of samples from the training and validationsets to generate candidate explanations for. Alternatively, set to'all' to sample the entire training and validation sets.
- NUM_EXPLANATIONS: The desired number of explanations thatmaximize explanation coverage
- EXPERIMENT: The type of LIME interpretability experiment you wouldlike to perform if executinglime_explain.py. Choices are'explain_client','lime_experiment', or'submodular_pick'.
- N_MIN: Smallest prediction horizon to use in the predictionhorizon search experiment (in weeks)
- N_MAX: Largest prediction horizon to use in the prediction horizonsearch experiment (in weeks)
- N_INTERVAL: Size of increment to increase the prediction horizonby when iterating through possible prediction horizons (in weeks)
- THRESHOLD: Classification threshold for prediction
- CLASS_NAMES: Identifiers for the classes predicted by the neuralnetwork as included in the prediction spreadsheet.
- EXPERIMENT: The type of prediction experiment you would like toperform if executingpredict.py. Choices are'batch_prediction' or'trending_prediction'.
- K: Desired number of client clusters when running k-prototypes
- N_RUNS: Number of attempts at running k-prototypes. Best clustersare saved.
- N_JOBS: Number of parallel compute jobs to create fork-prototypes. This is useful when N_RUNS is high and you want todecrease the total runtime of clustering. Before increasing thisnumber, check how many CPUs are available to you.
- K_MIN: Minimum value ofk to investigate in the SilhouetteAnalysis experiment
- K_MAX: Maximum value ofk to investigate in the SilhouetteAnalysis experiment
- EXPERIMENT: The type of k-prototypes clustering experiment youwould like to perform if executingcluster.py. Choices are'cluster_clients' or'silhouette_analysis'.
We deployed our model retraining and batch predictions functionality toAzure cloud computing services. To do this, we created Jupyter notebooksto define and run experiments in Azure, and Python scripts correspondingto pipeline steps. We included these files in theazure/ folder, incase they may benefit any parties hoping to use this project. Note thatAzure isnot required to run HIFIS-v2 code, as all Python filesnecessary to get started are in thesrc/ folder.
We deployed our model training and batch predictions functionality toAzure cloud computing services. To do this, we created Jupyter notebooksto define and run Azure machine learning pipelines as experiments inAzure, and Python scripts corresponding to pipeline steps. We includedthese files in theazure/ folder, in case they may benefit any partieshoping to use this project. Note that Azure isnot required to runHIFIS-v2 code, as all Python files necessary to get started are in thesrc/ folder. If you plan on using the Azure machine learning pipelinesdefined in theazure/ folder, there are a few steps you will need tofollow first:
- Obtain an active Azure subscription.
- Ensure you have installed the latest version of theazureml-sdk andazureml_widgets pip packages.
- In theAzure portal, create a resourcegroup.
- In theAzure portal, create a machinelearning workspace, and set its resource group to be the one youcreated in step 2. When you open your workspace in the portal, therewill be a button near the top of the page that reads "Downloadconfig.json". Click this button to download the config file for theworkspace, which contains confidential information about yourworkspace and subscription. Once the config file is downloaded,rename it tows_config.json. Move this file to theazure/ folderin the HIFIS-v2 repository. For reference, the contents ofws_config.json resemble the following:
{ "subscription_id": "your-subscription-id", "resource_group": "name-of-your-resource-group", "workspace_name": "name-of-your-machine-learning-workspace"}
Matt Ross
Manager, Artificial Intelligence
Information Technology Services, City Manager’s Office
City of London
Suite 300 - 201 Queens Ave, London, ON. N6A 1J1
Blake VanBerlo
Data Scientist
City of London Municipal Artificial Intelligence Applications Lab
C:blake@vanberloconsulting.com