Question about the comparasion

Hi, thanks for the work. From my aspect of view, Qwen2VL-Ins or other base model haven't trained on SAT or likely data at all. Does training directly on SAT with R1 really convincing