MSKCC-Epi-Bio/decisioncurveanalysisPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star19

Should the famhistory model in the tutorial not use probabilities in stead of binary coding#12

AnsweredbyVickersA

PaulHiemstra asked this question inQ&A

PaulHiemstra

Nov 29, 2023

· 3 comments

AnsweredbyVickersAReturn to top

Discussion options

PaulHiemstra
Nov 29, 2023

I was following the tutorial on DCA (Python), and tried to independently reproduce the results in order to understand what was happening under the hood. I used the definition of net benefit I found in the simple step-by-step paper of Vickers (2019):

So I set out to calculate the required stats, but could not get the right outcomes for thefamhistory logistical regression model:

The fix was to not get the probabilities from the logistical regression model (mod1_results.predict()), but to compare the raw binary famhistory data to the threshold:

And input these predictions into the sensitivity and specificity calculations. The Python package even has a specific input argument to force the probabilities to be calculated (models_to_prob), but this is explicitly not used in the tutorial.

I have the idea I am missing something obvious here. Could you help me out?

You must be logged in to vote

Answered by VickersA

Nov 30, 2023

This is really a philosophical issue, and one that might be obscured a bit by the rather artificial nature of the didactic example. A binary test can be viewed in two ways: 1) treat if test is positive, don't treat if negative; 2) if test is positive, cite positive predictive value, if negative, cite 1 - negative predictive value and then treat or not depending on threshold probability. There is no right or wrong answer here, but the first approach is taken in most of the decision curve analysis methodologic literature and is hard wired into the code: probability is assumed to be 1 if test is positive and 0 if negative. But as you nicely demonstrated, it is easy to use approach (2) if tha…

View full answer

Replies: 3 comments

Comment options

VickersA
Nov 30, 2023
Maintainer

You must be logged in to vote

0 replies

Answer selected byPaulHiemstra

Comment options

PaulHiemstra
Nov 30, 2023
Author

Thanks for responding! So I think what you are saying that you try and emulate how the end-user would interpret the model. That makes things clear for me.

For reference for other people, this code reproduces the net_benefit curves for both the raw and probability options. The code is not meant to be flexible as the code in the package, so I think it is easier to see what is going on exactly (and I hope I did everything correctly :)):

from sklearn.metrics import recall_scoreimport pandas as pdimport numpy as npimport statsmodels.api as smdef net_benefit(sensitivity, specificity, prevalence, threshold):    return (sensitivity * prevalence) - ((1 - specificity) * (1 - prevalence)) * (threshold / (1 - threshold))df_cancer_dx = pd.read_csv('https://raw.githubusercontent.com/ddsjoberg/dca-tutorial/main/data/df_cancer_dx.csv')# Simple model with only family historymod1 = sm.GLM.from_formula('cancer ~ famhistory', data=df_cancer_dx, family=sm.families.Binomial())mod1_results = mod1.fit()# More complex model with family history, age, and markermod2 = sm.GLM.from_formula('cancer ~ marker + age + famhistory', data=df_cancer_dx, family=sm.families.Binomial())mod2_results = mod2.fit()e = 1e-6   # needed because 0 or 1 does not lead to correct TP/FP valuesthold = np.linspace(0 + e, 0.35, 100)true = df_cancer_dx['cancer']prevalence_cancer = df_cancer_dx['cancer'].sum() / len(df_cancer_dx['cancer'])  # helper function to get sensitivity and specificity for a given predictiondef get_sensitivity_specificity(pred):    sensitivity = recall_score(true, pred)    specificity = recall_score(np.logical_not(true), np.logical_not(pred))    return sensitivity, specificity# Helper function to calculate net benefit for a given input vector of prediction probabilitiesdef calc_net_benefit(input_prob):    sensitivity, specificity = zip(*[get_sensitivity_specificity((input_prob >= prob).astype(int)) for prob in thold])    return net_benefit(np.array(sensitivity), np.array(specificity), prevalence_cancer, thold)# Calculate net benefit for all optionsresult = pd.DataFrame(index=thold)result['raw']      = calc_net_benefit(df_cancer_dx['famhistory'])result['proba']    = calc_net_benefit(mod1_results.predict())result['complex']  = calc_net_benefit(mod2_results.predict())result['everyone'] = calc_net_benefit(np.ones(len(df_cancer_dx['cancer'])))result['no one']   = calc_net_benefit(np.zeros(len(df_cancer_dx['cancer'])))result.plot()

Nice to see:

The curves match the output of the package
The proba curve of the famhistory model follows the max of the none, all and famhistory on binary probability.

edit: I expanded the code example to include all the other benefit curves mentioned in the tutorial.

You must be logged in to vote

0 replies

Comment options

VickersA
Nov 30, 2023
Maintainer

That is great. An alternative approach would be the following (written in Stata)
*Create a new variable for family history giving risks for positive and negative rather than zero and one
generate famhistorypv=.
*Get PPV: this is the mean of cancer if family history is positive (i.e. =1)
summarize cancer if famhistory==1
*Make the new variable the PPV if family history is positive
replace famhistorypv=r(mean) if famhistory==1
*Get 1-NPPV: this is the mean of cancer if family history is negative (i.e. =0)
summarize cancer if famhistory==0
*Make the new variable the PPV if family history is positive
replace famhistorypv=r(mean) if famhistory==0

You must be logged in to vote

0 replies

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Should the famhistory model in the tutorial not use probabilities in stead of binary coding#12

Uh oh!

{{title}}

Uh oh!

PaulHiemstra
Nov 29, 2023

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

VickersA
Nov 30, 2023
Maintainer

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

PaulHiemstra
Nov 30, 2023
Author

Uh oh!

{{title}}

Uh oh!

VickersA
Nov 30, 2023
Maintainer

Select a reply

Uh oh!

Movatterモバイル変換

Should the famhistory model in the tutorial not use probabilities in stead of binary coding#12

Uh oh!

PaulHiemstraNov 29, 2023

Replies: 3 comments

Uh oh!

VickersANov 30, 2023 Maintainer

Uh oh!

Uh oh!

PaulHiemstraNov 30, 2023 Author

Uh oh!

VickersANov 30, 2023 Maintainer

Uh oh!

PaulHiemstra
Nov 29, 2023

VickersA
Nov 30, 2023
Maintainer

PaulHiemstra
Nov 30, 2023
Author

VickersA
Nov 30, 2023
Maintainer