-
Notifications
You must be signed in to change notification settings - Fork 4k
[python-package] scikit-learn fit() methods: add eval_X, eval_y, deprecate eval_set #6857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
197a3bd to
5dd3171
Compare
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It's looking like you're struggling to get this passing CI, so I'm going to put into "draft" for now. @ me any time here if you need help with development, and we can open this back up for review once CI is passing.
I saw you had multiple commits responding to linting errors... here's how to run those locally for faster feedback:
# (or conda, whatever you want)
pip install pre-commit
pre-commit run --all-filesAnd here's how to build locally and run the tests:
# step 1: compile lib_ligihgbtm
# (only need to do this once, because you're not making any C/C++ changes)
cmake -B build -S .
cmake --build build --target _lightgbm -j4
# step 2: install the Python package, re-using it
# (do this every time you change Python code in the library)
sh build-python.sh install --precompile
# step 3: run the scikit-learn tests
pytest tests/python_package_test/test_sklearn.py|
@jameslamb Thanks for your suggestions. |
Making both options available for a time and raising a deprecation warning when I'm sorry but I cannot invest much time in this right now (for example, looking into whether this would introduce inconsistencies with
No, please. As I said in scikit-learn/scikit-learn#28901 (comment), removing |
|
@jameslamb I'm sorry, I really need a maintainer's help. The tests in Details |
|
@jameslamb Thank you so much. Pinning |
|
The remaining CI failures seem unrelated. |
@lorentzenchr if you are interested in continuing this I'd be happy to help with reviews. I'm supportive of adding this, for better compatibility with newer versions of |
|
@jameslamb Yes, I‘d like to finish this. Your review would be great. Anything from me you need before you can start reviewing? |
|
Great! I'd been waiting to review until you were done adding whatever tests you wanted. If you'd like a review before then, update this to latest |
48bd40f to
69371c6
Compare
|
@jameslamb I rebased with master. CI failures seem unrelated (only R-packages which are not touched). So ready for review from my side. |
jameslamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much! Overall this is looking good. I left a couple of suggestions I'd like to see implemented.
After that I'd be happy to merge this, I'd really like to see it in the next lightgbm release.
cc @jmoralez I'd love a review from you too if you have time
| eval_set: Optional[List[_LGBM_ScikitValidSet]] = None, | ||
| eval_names: Optional[List[str]] = None, | ||
| eval_X: Optional[Union[_LGBM_ScikitMatrixLike, Tuple[_LGBM_ScikitMatrixLike]]] = None, | ||
| eval_y: Optional[Union[_LGBM_LabelType, Tuple[_LGBM_LabelType]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting new arguments in the middle of the signature like this is a breaking change from the perspective of anyone calling these functions with positional arguments.
Could you please move these new arguments to the end of the signature and make them keyword-only?
Like this:
def fit(
self,
X,
# ... other arguments that are already there ...
*,
eval_X: Optional[Union[_LGBM_ScikitMatrixLike, Tuple[_LGBM_ScikitMatrixLike]]] = None,
eval_y: Optional[Union[_LGBM_LabelType, Tuple[_LGBM_LabelType]]] = None,Making all new arguments added here keyword-only would allow us to change the ordering in the future without breaking anyone because we'll know there cannot be any code calling them positionally.
| eval_set = list(zip(eval_X, eval_y)) | ||
| else: | ||
| eval_set = [(eval_X, eval_y)] | ||
| return eval_set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add test cases covering all these if-else branches? e.g. using pytest.raises() for the error cases and checking that the expected error type + message are raised?
There are notes in #7031 on how to generate a coverage report with pytest-cov for LightGBM's python tests if you want to use that to double-check.
| This is deprecated, use `eval_X` and `eval_y` instead. | ||
| eval_names : list of str, or None, optional (default=None) | ||
| Names of eval_set. | ||
| eval_sample_weight : {eval_sample_weight_shape} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this change a mistake? I don't see any code changes to the handling of eval_sample_weight in this PR.
Right now the doc for this says the following:
LightGBM/python-package/lightgbm/sklearn.py
Lines 1085 to 1087 in 544d439
| eval_sample_weight_shape="list of array (same types as ``sample_weight`` supports), or None, optional (default=None)", | |
| eval_init_score_shape="list of array (same types as ``init_score`` supports), or None, optional (default=None)", | |
| eval_group_shape="list of array (same types as ``group`` supports), or None, optional (default=None)", |
If you're changing this because you found that "list" is overly specific and that other iterables like tuple could be provided, then I'm ok with this change but please:
- make it in the formatting variable
eval_sample_weight_shapeinstead of tacking on "or tuple" here - double-check if
eval_init_scoreandeval_groupcould also be modified that way - add tests in
test_sklearn.pycovering tuples being provided for those inputs
| X, y = make_synthetic_regression() | ||
| X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) | ||
| cbs = [lgb.early_stopping(2)] | ||
| gbm1 = lgb.LGBMRegressor() | ||
| gbm1.fit(X_train, y_train, eval_set=[(X_test, y_test)], callbacks=cbs) | ||
| gbm2 = lgb.LGBMRegressor() | ||
| gbm2.fit(X_train, y_train, eval_X=X_test, eval_y=y_test, callbacks=cbs) | ||
| np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X)) | ||
|
|
||
| # 2 evaluation sets | ||
| n = X_test.shape[0] | ||
| X_test1, X_test2 = X_test[: n // 2], X_test[n // 2 :] | ||
| y_test1, y_test2 = y_test[: n // 2], y_test[n // 2 :] | ||
| gbm1 = lgb.LGBMRegressor() | ||
| gbm1.fit(X_train, y_train, eval_set=[(X_test1, y_test1), (X_test2, y_test2)], callbacks=cbs) | ||
| gbm2 = lgb.LGBMRegressor() | ||
| gbm2.fit(X_train, y_train, eval_X=(X_test1, X_test2), eval_y=(y_test1, y_test2), callbacks=cbs) | ||
| np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X)) | ||
| assert gbm1.evals_result_["valid_0"]["l2"][0] == pytest.approx(gbm2.evals_result_["valid_0"]["l2"][0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| X, y = make_synthetic_regression() | |
| X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) | |
| cbs = [lgb.early_stopping(2)] | |
| gbm1 = lgb.LGBMRegressor() | |
| gbm1.fit(X_train, y_train, eval_set=[(X_test, y_test)], callbacks=cbs) | |
| gbm2 = lgb.LGBMRegressor() | |
| gbm2.fit(X_train, y_train, eval_X=X_test, eval_y=y_test, callbacks=cbs) | |
| np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X)) | |
| # 2 evaluation sets | |
| n = X_test.shape[0] | |
| X_test1, X_test2 = X_test[: n // 2], X_test[n // 2 :] | |
| y_test1, y_test2 = y_test[: n // 2], y_test[n // 2 :] | |
| gbm1 = lgb.LGBMRegressor() | |
| gbm1.fit(X_train, y_train, eval_set=[(X_test1, y_test1), (X_test2, y_test2)], callbacks=cbs) | |
| gbm2 = lgb.LGBMRegressor() | |
| gbm2.fit(X_train, y_train, eval_X=(X_test1, X_test2), eval_y=(y_test1, y_test2), callbacks=cbs) | |
| np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X)) | |
| assert gbm1.evals_result_["valid_0"]["l2"][0] == pytest.approx(gbm2.evals_result_["valid_0"]["l2"][0]) | |
| X, y = make_synthetic_regression() | |
| X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) | |
| params = { | |
| "deterministic": True, | |
| "force_row_wise": True, | |
| "n_jobs": 1, | |
| "seed": 708, | |
| } | |
| cbs = [lgb.early_stopping(2)] | |
| gbm1 = lgb.LGBMRegressor(**params) | |
| gbm1.fit(X_train, y_train, eval_set=[(X_test, y_test)], callbacks=cbs) | |
| gbm2 = lgb.LGBMRegressor(**params) | |
| gbm2.fit(X_train, y_train, eval_X=X_test, eval_y=y_test, callbacks=cbs) | |
| np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X)) | |
| # 2 evaluation sets | |
| n = X_test.shape[0] | |
| X_test1, X_test2 = X_test[: n // 2], X_test[n // 2 :] | |
| y_test1, y_test2 = y_test[: n // 2], y_test[n // 2 :] | |
| gbm1 = lgb.LGBMRegressor(**params) | |
| gbm1.fit(X_train, y_train, eval_set=[(X_test1, y_test1), (X_test2, y_test2)], callbacks=cbs) | |
| gbm2 = lgb.LGBMRegressor(**params) | |
| gbm2.fit(X_train, y_train, eval_X=(X_test1, X_test2), eval_y=(y_test1, y_test2), callbacks=cbs) | |
| np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X)) | |
| assert gbm1.evals_result_["valid_0"]["l2"][0] == pytest.approx(gbm2.evals_result_["valid_0"]["l2"][0]) |
Let's please pass some parameter values to guarantee a deterministic fit, so the test won't fail due to random differences.
As discussed in scikit-learn/scikit-learn#28901 (comment), this PR adds
eval_Xandeval_yin order to make LGBM estimators compatible with scikit-learn's (as of version 1.6)Pipeline(..., transform_input=["eval_X"]).See also scikit-learn/scikit-learn#27124.