Yeah, there's no way to measure the complexity of the changes, which means I can't measure the statistical significance in these results. And you're correct that there's also no way to tie this to more cart additions.
But the thing that keeps me coming back to tests like these are how repeatable they are, and how repeating the tests with the same parameters gives extremely similar results every time.
The largest experimentation I've done is testing variants at n=500, and I was able to take away another level of insight from that because I added in a qualitative follow-up single ease question (SEQ).
This measured how easy or difficult the participant perceived the task to be. And thankfully (hah) the worst time-to-click performers also exhibited the worst SEQ results.
So while I agree with your statements, I think there definitely is a level of insight you can draw from this testing methodology, especially when you're trying to narrow down a field of 10 design variants for A/B testing, or something like that.
Quick, dirty, maybe not 100% accurate but it definitely provides good enough insight to steer your optimisation efforts.