Hardcore. Stochastic wandering.
My own investigation
When I started all this commotion, my intuition was as simple as:
๐ tie in score โ stochastic wandering on rectangular grid โ 0s and 1s for up/right moves โ binomial counting โ we can calculate everything
Indeed, if we want to make p steps up and n steps right, we have C(p+n, p) routes - school program. To get distribution over AUC you have to write a function which calculates it from "001101" sequence. Suddenly Young (or Ferrers) diagrams appeared. Bars under the curve increase their height, therefore each zero-one sequence encodes a correct Young diagram. Height of column is number of ones to the left of our current position, we are to sum it up when hit zero. So, AUC numerator is sum of #(all ones to the left of each zero) (column based count) or sum of #(all zeroes to the right of each one) (row based count).
If you meditate long enough, you can understand that AUC numerator is exactly number of correctly ordered pairs (1 on the left side of 0). AUC itself is this number divided by p*n.
Here my mathematical skills left me alone and I dove into the topic with ChatGPT.
Vibe science
ChatGPT gave me closed-form expression for AUC numerator variance so quickly, that I decided that this result is trivial. Big mistake. I found myself in the middle of complex and beautiful mathematics. Vast collection of names and facts started to bury me. So I picked up the most bright and interesting one. Let the show start!
๐ There is a mathematical object which addresses exactly the question we stated: how many paths there are with n negative samples, p positive samples and AUC numerator = q.
๐ Its name - Gaussian (q-binomial) coefficient
๐ Math expression is on the picture
๐ Gaussian coefficient is a polynomial and numbers of paths are coefficients of this polynomial
๐ It's a q-analog of binomial coefficient: when the polynomial variable is set to 1, gaussian coefficient becomes ordinary binomial coefficient
๐ Distribution is unimodal, with maximum around pn/2
๐ Center of distribution: q = pn/2
๐ Variance of q: Var(q) = pn(p+n+1)/12
๐ Variance of AUC: Var(AUC) = (p+n+1)/(12pn)
๐ Mode of distribution corresponds to the central coefficient(s) of the Gaussian coefficient. Sometimes it is one coefficient, sometimes there is a plateau.
Practical implications
๐น There are some packages which can calculate confidence intervals for ROC AUC, they are not popular
๐น The approach can be useful for feature selection and experiments with low-cardinality-output models
That's almost it. I think it would be useful to collect all references I found in one more post.
