Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Should the famhistory model in the tutorial not use probabilities in stead of binary coding#12

AnsweredbyVickersA
PaulHiemstra asked this question inQ&A
Discussion options

I was following the tutorial on DCA (Python), and tried to independently reproduce the results in order to understand what was happening under the hood. I used the definition of net benefit I found in the simple step-by-step paper of Vickers (2019):

image

So I set out to calculate the required stats, but could not get the right outcomes for thefamhistory logistical regression model:

image

The fix was to not get the probabilities from the logistical regression model (mod1_results.predict()), but to compare the raw binary famhistory data to the threshold:

image

And input these predictions into the sensitivity and specificity calculations. The Python package even has a specific input argument to force the probabilities to be calculated (models_to_prob), but this is explicitly not used in the tutorial.

I have the idea I am missing something obvious here. Could you help me out?

You must be logged in to vote

This is really a philosophical issue, and one that might be obscured a bit by the rather artificial nature of the didactic example. A binary test can be viewed in two ways: 1) treat if test is positive, don't treat if negative; 2) if test is positive, cite positive predictive value, if negative, cite 1 - negative predictive value and then treat or not depending on threshold probability. There is no right or wrong answer here, but the first approach is taken in most of the decision curve analysis methodologic literature and is hard wired into the code: probability is assumed to be 1 if test is positive and 0 if negative. But as you nicely demonstrated, it is easy to use approach (2) if tha…

Replies: 3 comments

Comment options

This is really a philosophical issue, and one that might be obscured a bit by the rather artificial nature of the didactic example. A binary test can be viewed in two ways: 1) treat if test is positive, don't treat if negative; 2) if test is positive, cite positive predictive value, if negative, cite 1 - negative predictive value and then treat or not depending on threshold probability. There is no right or wrong answer here, but the first approach is taken in most of the decision curve analysis methodologic literature and is hard wired into the code: probability is assumed to be 1 if test is positive and 0 if negative. But as you nicely demonstrated, it is easy to use approach (2) if that is what you prefer.

You must be logged in to vote
0 replies
Answer selected byPaulHiemstra
Comment options

Thanks for responding! So I think what you are saying that you try and emulate how the end-user would interpret the model. That makes things clear for me.

For reference for other people, this code reproduces the net_benefit curves for both the raw and probability options. The code is not meant to be flexible as the code in the package, so I think it is easier to see what is going on exactly (and I hope I did everything correctly :)):

from sklearn.metrics import recall_scoreimport pandas as pdimport numpy as npimport statsmodels.api as smdef net_benefit(sensitivity, specificity, prevalence, threshold):    return (sensitivity * prevalence) - ((1 - specificity) * (1 - prevalence)) * (threshold / (1 - threshold))df_cancer_dx = pd.read_csv('https://raw.githubusercontent.com/ddsjoberg/dca-tutorial/main/data/df_cancer_dx.csv')# Simple model with only family historymod1 = sm.GLM.from_formula('cancer ~ famhistory', data=df_cancer_dx, family=sm.families.Binomial())mod1_results = mod1.fit()# More complex model with family history, age, and markermod2 = sm.GLM.from_formula('cancer ~ marker + age + famhistory', data=df_cancer_dx, family=sm.families.Binomial())mod2_results = mod2.fit()e = 1e-6   # needed because 0 or 1 does not lead to correct TP/FP valuesthold = np.linspace(0 + e, 0.35, 100)true = df_cancer_dx['cancer']prevalence_cancer = df_cancer_dx['cancer'].sum() / len(df_cancer_dx['cancer'])  # helper function to get sensitivity and specificity for a given predictiondef get_sensitivity_specificity(pred):    sensitivity = recall_score(true, pred)    specificity = recall_score(np.logical_not(true), np.logical_not(pred))    return sensitivity, specificity# Helper function to calculate net benefit for a given input vector of prediction probabilitiesdef calc_net_benefit(input_prob):    sensitivity, specificity = zip(*[get_sensitivity_specificity((input_prob >= prob).astype(int)) for prob in thold])    return net_benefit(np.array(sensitivity), np.array(specificity), prevalence_cancer, thold)# Calculate net benefit for all optionsresult = pd.DataFrame(index=thold)result['raw']      = calc_net_benefit(df_cancer_dx['famhistory'])result['proba']    = calc_net_benefit(mod1_results.predict())result['complex']  = calc_net_benefit(mod2_results.predict())result['everyone'] = calc_net_benefit(np.ones(len(df_cancer_dx['cancer'])))result['no one']   = calc_net_benefit(np.zeros(len(df_cancer_dx['cancer'])))result.plot()

image

Nice to see:

  • The curves match the output of the package
  • The proba curve of the famhistory model follows the max of the none, all and famhistory on binary probability.

edit: I expanded the code example to include all the other benefit curves mentioned in the tutorial.

You must be logged in to vote
0 replies
Comment options

That is great. An alternative approach would be the following (written in Stata)
*Create a new variable for family history giving risks for positive and negative rather than zero and one
generate famhistorypv=.
*Get PPV: this is the mean of cancer if family history is positive (i.e. =1)
summarize cancer if famhistory==1
*Make the new variable the PPV if family history is positive
replace famhistorypv=r(mean) if famhistory==1
*Get 1-NPPV: this is the mean of cancer if family history is negative (i.e. =0)
summarize cancer if famhistory==0
*Make the new variable the PPV if family history is positive
replace famhistorypv=r(mean) if famhistory==0

You must be logged in to vote
0 replies
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
Q&A
Labels
None yet
2 participants
@PaulHiemstra@VickersA

[8]ページ先頭

©2009-2025 Movatter.jp