The same post in a slightly different format. Let's try and see what's work better for us.

Designing an MSE Dataset

Intro
Today I want to break our strict posting sequence and jump to a current problem.

I want to open my GitHub project and give users a nice dataset to test our method. There’s already a good dataset for the log-loss function, and I was struggling to design one for MSE. So here we are: we have a model that can group objects and find the best linear combination of basis functions to minimize a loss. How do we create a nice, interesting dataset for such a model? Buckle up—let’s dive in.

DIY Dataset

First of all, I started with a small bit of handcrafting and wrote “Extra Boost” with dots on a piece of paper. Then I uploaded it to ChatGPT and started a vibe-coding session. In no time I turned the dots into a scatter plot. But, as you know, screen and math coordinate systems differ in the direction of the Y-axis, so my title was upside down.

It seems trivial—barely worth mentioning. And, probably, it is in the world of traditional programming. But in vibe-coding it became quite a nuisance. Like two silly servants in an old Soviet cartoon, the model did two things at once: it changed the dataset and flipped the sign of Y. As a result, I spent quite a while staring at an upside-down title. After a while I figured out what was going on and got a normal scatter plot.

“And now what?” — probably the question you have in your head right now. Don’t worry—I had the same question. Then I started to think about properties of my approach:

it can group objects by static features;
it can find the best fit for several basis functions.
So I needed some interesting basis functions. OK: [1, t, sin(kt), cos(kt)]. The constant lets you move the plot up and down—always useful. t is for trends. The pair sin(kt) and cos(kt) lets us place a harmonic component where we want it; with the right amplitudes you can shift it left or right.

Let’s stop here. Where these basis functions show up in our “Extra Boost” title—I’ll explain in the next post.