Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Human-explainable AI.

License

NotificationsYou must be signed in to change notification settings

BCG-X-Official/facet

Repository files navigation

FACET is an open source library for human-explainable AI.It combines sophisticated model inspection and model-based simulation to enable betterexplanations of your supervised machine learning models.

FACET is composed of the following key components:

pypicondaazure_buildazure_code_covpython_versionscode_stylemade_with_sphinx_doclicense_badge

Installation

FACET supports both PyPI and Anaconda.We recommend to install FACET into a dedicated environment.

Anaconda

conda create -n facetconda activate facetconda install -c bcg_gamma -c conda-forge gamma-facet

Pip

macOS and Linux:

python -m venv facetsource facet/bin/activatepip install gamma-facet

Windows:

python -m venv facetfacet\Scripts\activate.batpip install gamma-facet

Quickstart

The following quickstart guide provides a minimal example workflow to get youup and running with FACET.For additional tutorials and the API reference,see theFACET documentation.

Changes and additions to new versions are summarized in therelease notes.

Enhanced Machine Learning Workflow

To demonstrate the model inspection capability of FACET, we first create apipeline to fit a learner. In this simple example we will use thediabetes datasetwhich contains age, sex, BMI and blood pressure along with 6 blood serummeasurements as features. This dataset was used in thispublication.A transformed version of this dataset is also available on scikit-learnhere.

In this quickstart we will train a Random Forest regressor using 10 repeated5-fold CV to predict disease progression after one year. With the use ofsklearndf we can create apandas DataFrame compatible workflow. However,FACET provides additional enhancements to keep track of our feature matrixand target vector using a sample object (Sample) and easily comparehyperparameter configurations and even multiple learners with the LearnerSelector.

# standard importsimportpandasaspdfromsklearn.model_selectionimportRepeatedKFold,GridSearchCV# some helpful imports from sklearndffromsklearndf.pipelineimportRegressorPipelineDFfromsklearndf.regressionimportRandomForestRegressorDF# relevant FACET importsfromfacet.dataimportSamplefromfacet.selectionimportLearnerSelector,ParameterSpace# declaring url with datadata_url='https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data'#importing data from urldiabetes_df=pd.read_csv(data_url,delimiter='\t').rename(# renaming columns for better readabilitycolumns={'S1':'TC',# total serum cholesterol'S2':'LDL',# low-density lipoproteins'S3':'HDL',# high-density lipoproteins'S4':'TCH',# total cholesterol/ HDL'S5':'LTG',# lamotrigine level'S6':'GLU',# blood sugar level'Y':'Disease_progression'# measure of progress since 1yr of baseline    })# create FACET sample objectdiabetes_sample=Sample(observations=diabetes_df,target_name="Disease_progression")# create a (trivial) pipeline for a random forest regressorrnd_forest_reg=RegressorPipelineDF(regressor=RandomForestRegressorDF(n_estimators=200,random_state=42))# define parameter space for models which are "competing" against each otherrnd_forest_ps=ParameterSpace(rnd_forest_reg)rnd_forest_ps.regressor.min_samples_leaf= [8,11,15]rnd_forest_ps.regressor.max_depth= [4,5,6]# create repeated k-fold CV iteratorrkf_cv=RepeatedKFold(n_splits=5,n_repeats=10,random_state=42)# rank your candidate models by performanceselector=LearnerSelector(searcher_type=GridSearchCV,parameter_space=rnd_forest_ps,cv=rkf_cv,n_jobs=-3,scoring="r2").fit(diabetes_sample)# get summary reportselector.summary_report()

sphinx/source/_images/ranker_summary.png

We can see based on this minimal workflow that a value of 11 for minimumsamples in the leaf and 5 for maximum tree depth was the best performingof the three considered values.This approach easily extends to additional hyperparameters for the learner,and for multiple learners.

Model Inspection

FACET implements several model inspection methods forscikit-learn estimators.FACET enhances model inspection by providing global metrics that complementthe local perspective of SHAP (see[arXiv:2107.12436] for a formal description).

The key global metrics for each pair of features in a model are:

  • Synergy

    The degree to which the model combines information from one feature withanother to predict the target. For example, let's assume we are predictingcardiovascular health using age and gender and the fitted model includesa complex interaction between them. This means these two features aresynergistic for predicting cardiovascular health. Further, both featuresare important to the model and removing either one would significantlyimpact performance. Let's assume age brings more information to the jointcontribution than gender. This asymmetric contribution means the synergy for(age, gender) is less than the synergy for (gender, age). To think about it anotherway, imagine the prediction is a coordinate you are trying to reach.From your starting point, age gets you much closer to this point thangender, however, you need both to get there. Synergy reflects the factthat gender gets more help from age (higher synergy from the perspectiveof gender) than age does from gender (lower synergy from the perspective ofage) to reach the prediction.This leads to an important point: synergyis a naturally asymmetric property of the global information two interactingfeatures contribute to the model predictions. Synergy is expressed as apercentage ranging from 0% (full autonomy) to 100% (full synergy).

  • Redundancy

    The degree to which a feature in a model duplicates the information of asecond feature to predict the target. For example, let's assume we hadhouse size and number of bedrooms for predicting house price. Thesefeatures capture similar information as the more bedrooms the largerthe house and likely a higher price on average. The redundancy for(number of bedrooms, house size) will be greater than the redundancyfor (house size, number of bedrooms). This is because house size"knows" more of what number of bedrooms does for predicting house pricethan vice-versa. Hence, there is greater redundancy from the perspectiveof number of bedrooms. Another way to think about it is removing housesize will be more detrimental to model performance than removing numberof bedrooms, as house size can better compensate for the absence ofnumber of bedrooms. This also implies that house size would be a moreimportant feature than number of bedrooms in the model.The importantpoint here is that like synergy, redundancy is a naturally asymmetricproperty of the global information feature pairs have for predictingan outcome. Redundancy is expressed as a percentage ranging from 0%(full uniqueness) to 100% (full redundancy).

# fit the model inspectorfromfacet.inspectionimportLearnerInspectorinspector=LearnerInspector(pipeline=selector.best_estimator_,n_jobs=-3).fit(diabetes_sample)

Synergy

# visualise synergy as a matrixfrompytools.viz.matriximportMatrixDrawersynergy_matrix=inspector.feature_synergy_matrix()MatrixDrawer(style="matplot%").draw(synergy_matrix,title="Synergy Matrix")

sphinx/source/_images/synergy_matrix.png

For any feature pair (A, B), the first feature (A) is the row, and the secondfeature (B) the column. For example, looking across the row for LTG (Lamotrigine)there is hardly any synergy with other features in the model (≤ 1%).However, looking down the column for LTG (i.e., from the perspective of other featuresrelative with LTG) we find that many features (the rows) are aided by synergy withwith LTG (up to 27% in the case of LDL). We conclude that:

  • LTG is a strongly autonomous feature, displaying minimal synergy with otherfeatures for predicting disease progression after one year.
  • The contribution of other features to predicting disease progression after oneyear is partly enabled by the presence of LTG.

High synergy between pairs of features must be considered carefully when investigatingimpact, as the values of both features jointly determine the outcome. It would not makemuch sense to consider LDL without the context provided by LTG given closeto 27% synergy of LDL with LTG for predicting progression after one year.

Redundancy

# visualise redundancy as a matrixredundancy_matrix=inspector.feature_redundancy_matrix()MatrixDrawer(style="matplot%").draw(redundancy_matrix,title="Redundancy Matrix")

sphinx/source/_images/redundancy_matrix.png

For any feature pair (A, B), the first feature (A) is the row, and the second feature(B) the column. For example, if we look at the feature pair (LDL, TC) from theperspective of LDL (Low-Density Lipoproteins), then we look-up the row for LDLand the column for TC and find 38% redundancy. This means that 38% of the informationin LDL to predict disease progression is duplicated in TC. Thisredundancy is the same when looking "from the perspective" of TC for (TC, LDL),but need not be symmetrical in all cases (see LTG vs. TCH).

If we look at TCH, it has between 22–32% redundancy each with LTG and HDL, butthe same does not hold between LTG and HDL – meaning TCH shares differentinformation with each of the two features.

Clustering redundancy

As detailed above redundancy and synergy for a feature pair is from the"perspective" of one of the features in the pair, and so yields two distinctvalues. However, a symmetric version can also be computed that provides notonly a simplified perspective but allows the use of (1 - metric) as afeature distance. With this distance hierarchical, single linkage clusteringis applied to create a dendrogram visualization. This helps to identifygroups of low distance, features which activate "in tandem" to predict theoutcome. Such information can then be used to either reduce clusters ofhighly redundant features to a subset or highlight clusters of highlysynergistic features that should always be considered together.

Let's look at the example for redundancy.

# visualise redundancy using a dendrogramfrompytools.viz.dendrogramimportDendrogramDrawerredundancy=inspector.feature_redundancy_linkage()DendrogramDrawer().draw(data=redundancy,title="Redundancy Dendrogram")

sphinx/source/_images/redundancy_dendrogram.png

Based on the dendrogram we can see that the feature pairs (LDL, TC)and (HDL, TCH) each represent a cluster in the dendrogram and that LTG and BMIhave the highest importance. As potential next actions we could explore the impact ofremoving TCH, and one of TC or LDL to further simplify the model and obtain areduced set of independent features.

Please see theAPI referencefor more detail.

Model Simulation

Taking the BMI feature as an example of an important and highly independent feature,we do the following for the simulation:

  • We use FACET's ContinuousRangePartitioner to split the range of observed values ofBMI into intervals of equal size. Each partition is represented by the central valueof that partition.
  • For each partition, the simulator creates an artificial copy of the original sampleassuming the variable to be simulated has the same value across all observations –which is the value representing the partition. Using the best estimatoracquired from the selector, the simulator now re-predicts all targets using the modelstrained for full sample and determines the uplift of the target variableresulting from this.
  • The FACET SimulationDrawer allows us to visualise the result; both in amatplotlib and a plain-text style.
# FACET importsfromfacet.validationimportBootstrapCVfromfacet.simulationimportUnivariateUpliftSimulatorfromfacet.data.partitionimportContinuousRangePartitionerfromfacet.simulation.vizimportSimulationDrawer# create bootstrap CV iteratorbscv=BootstrapCV(n_splits=1000,random_state=42)SIM_FEAT="BMI"simulator=UnivariateUpliftSimulator(model=selector.best_estimator_,sample=diabetes_sample,n_jobs=-3)# split the simulation range into equal sized partitionspartitioner=ContinuousRangePartitioner()# run the simulationsimulation=simulator.simulate_feature(feature_name=SIM_FEAT,partitioner=partitioner)# visualise resultsSimulationDrawer().draw(data=simulation,title=SIM_FEAT)

sphinx/source/_images/simulation_output.png

We would conclude from the figure that higher values of BMI are associated withan increase in disease progression after one year, and that for a BMI of 28and above, there is a significant increase in disease progression after one yearof at least 26 points.

Contributing

FACET is stable and is being supported long-term.

Contributions to FACET are welcome and appreciated.For any bug reports or feature requests/enhancements please use the appropriateGitHub form, and if you wish to do so,please open a PR addressing the issue.

We do ask that for any major changes please discuss these with us first via an issue orusing our team email:FacetTeam@bcg.com.

For further information on contributing please see ourcontribution guide.

License

FACET is licensed under Apache 2.0 as described in theLICENSE file.

Acknowledgements

FACET is built on top of two popular packages for Machine Learning:

  • Thescikit-learn learners andpipelining make up implementation of the underlying algorithms. Moreover, we triedto design the FACET API to align with the scikit-learn API.
  • TheSHAP implementation is used toestimate the shapley vectors which FACET then decomposes into synergy, redundancy,and independence vectors.

BCG GAMMA

If you would like to know more about the team behind FACET please see theabout us page.

We are always on the lookout for passionate and talented data scientists to join theBCG GAMMA team. If you would like to know more you can find out aboutBCG GAMMA,or have a look atcareer opportunities.


[8]ページ先頭

©2009-2025 Movatter.jp