Intro · PhysMath Hub

Intro
In previous posts I explained why using gradient-boosted decision trees “as is” is problematic for credit scoring and gave a glimpse of the factors used in this area. Today let’s discuss the approach Equifax used to create datasets—and how to apply extrapolating gradient-boosted decision trees to this problem.

Dataset
About the target. We check whether an applicant paid their loan back within three months: paid back → “0”, didn’t → “1”. That creates an extra headache: by the time the target is known, the dataset is already three months out of date.

Equifax had several million applications in an Oracle database. As I vaguely recall, there were hundreds of thousands of unique applicants. Our task was to run a massive GROUP BY over the database and compute counts like the number of unique addresses, phone numbers, etc. That wasn’t the last step: for each “raw” factor we also derived a set of binary factors—so-called “flags.”

Map-Reduce over Oracle
It turned out the speed of this procedure depended non-linearly on chunk size, so I wrote the longest PL/SQL script of my life—about 400 lines. It pulled chunks of ~100,000 records, processed them, and stored the results. After the whole table was processed, all chunks were joined together. So to speak, MapReduce over Oracle. The calculations took about five days. Coming from Yandex’s MapReduce platform, I was in quiet disbelief: a GROUP BY over a couple of million records should be five minutes, not five days. I still think we’re missing tools for meso-scale data—something between a pandas DataFrame and full-blown MapReduce—but that’s a topic for another post.

Obstacles
With this dataset in hand, we started experiments. A few notable moments:

I found a bug in XGBoost (something about importing data from Python arrays).
GBDT on binary flag features had the same quality as linear regression (boosting can find better thresholds that are frozen in the flags).
Predictions were unstable; quality degraded quickly over time.

Unfortunately, overcoming these obstacles didn’t build a strong relationship with my manager. On the contrary, it led to the termination of my employment at Equifax at the end of the trial period.

The birth of the approach
By the end of that period I had an idea: if the system is unstable over time, we can leverage that by trying to learn temporal dependencies. That’s exactly the idea behind Gradient-Boosted Decision Trees with Extrapolation. I remember the moment clearly—it was Friday the 13th, April 13, 2018.

To finalize the dataset description: a binary target with ~1% positive rate; a dozen or two “raw” features like “number of unique phone numbers in previous applications”; and about 75 derived “flag” features built from those raw features.

I said goodbye to that dataset, but the idea was too interesting to drop—I was eager to implement it.