I've been fitting an SVC to experimental data and came across an issue where RocCurveDisplay sometimes was putting the curves upside down. I know this happens if the positive label gets mixed up between fitting and plotting but here it is happening randomly. It only happens when I specifyprobability=True to the model which the docs say requires randomness to compute the probabilities. I did a bit more investigation to get to a much smaller code example. I found that passing in aRandomState gives repeatable behaviour with some random states plot the ROC curve correctly and some wrongly. From this I found thatpredict_proba seems to do one of two things depending on the random state and this leads to one of the two ROC curves. I couldn't get any further because it seems very data dependent; just deleting a few data rows made the problem go away but I was able to manually truncate my data to 2 decimal places and nothing changed. Sorting the data by increasing value also seems to be necessary. Here is an illustration - the first column is a 'correct' ROC curve and the corresponding probabilities, and the second is a 'wrong' one. Note that the y-scale of probabilities is very different. 
Any comments welcome, especially if I am doing something daft here. Thanks. Here is the code, from a Jupyter notebook from matplotlib import pyplot as pltfrom sklearn.metrics import RocCurveDisplayfrom sklearn.svm import SVCimport numpy as npfrom numpy.random import RandomStateimport pandas as pdrndst1 = RandomState(1)rndst2 = RandomState(2)def load_data(): negs = [ 0.14, 0.59, 1.13, 2.60, 2.92, 2.98, 3.99, 4.08, 4.43, 7.73, 10.98, ] poss = [ 1.84, 2.15, 2.73, 3.46, 3.59, 3.63, 3.67, 3.75, 4.49, 5.22, 5.33, 5.35, 5.51, 5.69, 5.72, 5.90, 5.98, 6.29, 7.96, 7.98, 8.21, 8.62, 9.27, 10.88, 11.84, 13.11, 19.12, 20.09, 21.99, 25.00, 35.00, ] return pd.DataFrame( { "Class": np.concat([["NEG"] * len(negs), ["POS"] * len(poss)]), "ScoreA": np.concat([negs, poss]), } )model1 = SVC(probability=True, kernel="rbf", random_state=rndst1)model2 = SVC(probability=True, kernel="rbf", random_state=rndst2)data = load_data()data.sort_values(by="ScoreA", inplace=True)Xs = data[["ScoreA"]]ys = data["Class"]model1.fit(Xs, ys)model2.fit(Xs, ys)probs1 = model1.predict_proba(Xs)[:, 1]probs2 = model2.predict_proba(Xs)[:, 1]probs=pd.DataFrame({"probs1":probs1, "probs2":probs2})display(probs.describe().style.format(precision=2))fig, axs = plt.subplots(2, 2, figsize=(10,10))fig.tight_layout()RocCurveDisplay.from_predictions(y_pred=probs1, y_true=ys, pos_label="POS", ax=axs[0,0])RocCurveDisplay.from_predictions(y_pred=probs2, y_true=ys, pos_label="POS", ax=axs[0,1])xv = np.arange(len(probs1))axs[1,0].plot(xv, probs1)axs[1,1].plot(xv, probs2)
I am using scikit-learn 1.5.2, numpy 2.1.3 and python 3.12.10. |