-
Notifications
You must be signed in to change notification settings - Fork 4k
[c++] Add Bagging by Query for Lambdarank #6623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…hub.com/Microsoft/LightGBM into bagging/bagging-by-query-for-lambdarank
| void GetGradients(const double* scores, const data_size_t /*num_sampled_queries*/, const data_size_t* /*sampled_query_indices*/, score_t* gradients, score_t* hessians) const override { | ||
| LaunchGetGradientsKernel(scores, gradients, hessians); | ||
| SynchronizeCUDADevice(__FILE__, __LINE__); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
@guolinke Could you please help to review this when you have time? Thanks. |
StrikerRUS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some very minor suggestions from me below:
| assert ndcg_score_bagging_by_query >= ndcg_score - 0.1 | ||
| assert ndcg_score_no_bagging_by_query >= ndcg_score - 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR's description states that bagging_by_query=True should improve metrics, but I don't see any comparison of bagging_by_query=True and bagging_by_query=False here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I found the result can be random when the dataset is small. For example, on CPU, bagging_by_query=True gets higher NDCG with the toy test dataset (even higher than the case without bagging), while with GPU bagging_by_query=True could get worse results compared with bagging_by_query=False. But when the dataset is large, for example, with MS LTR dataset, the results are less random, and bagging_by_query=True should improve performance, as in the description of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we also see a significant improvement in training speed with bagging_by_query=True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see. So this test is for something like "bagging_by_query=True doesn't break training".
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
StrikerRUS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
|
This pull request has been automatically locked since there has not been any recent activity since it was closed. |
Add bagging by query instead of by items in lambdarank, suggested by @metpavel. This should be more reasonable for bagging in ranking tasks. For a comparison of performance, on MS LTR dataset:
With
bagging_freq=1andbagging_fraction=0.1, ifbagging_by_query=trueand if
bagging_by_query=falseWithout bagging