Skip to content

Conversation

@lorentzenchr
Copy link
Contributor

@lorentzenchr lorentzenchr commented Mar 9, 2025

As discussed in scikit-learn/scikit-learn#28901 (comment), this PR adds eval_X and eval_y in order to make LGBM estimators compatible with scikit-learn's (as of version 1.6) Pipeline(..., transform_input=["eval_X"]).

See also scikit-learn/scikit-learn#27124.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It's looking like you're struggling to get this passing CI, so I'm going to put into "draft" for now. @ me any time here if you need help with development, and we can open this back up for review once CI is passing.

I saw you had multiple commits responding to linting errors... here's how to run those locally for faster feedback:

# (or conda, whatever you want)
pip install pre-commit
pre-commit run --all-files

And here's how to build locally and run the tests:

# step 1: compile lib_ligihgbtm
# (only need to do this once, because you're not making any C/C++ changes)
cmake -B build -S .
cmake --build build --target _lightgbm -j4

# step 2: install the Python package, re-using it
# (do this every time you change Python code in the library)
sh build-python.sh install --precompile

# step 3: run the scikit-learn tests
pytest tests/python_package_test/test_sklearn.py

@jameslamb jameslamb marked this pull request as draft March 9, 2025 23:15
@lorentzenchr
Copy link
Contributor Author

@jameslamb Thanks for your suggestions.
Could you already comment on the deprecation strategy, raising a warning?
Then, should I adapt all the (scikit-learn api) python tests and replace eval_set by the new eval_X, eval_y (thereby avoiding the warnings in the tests)?

@jameslamb
Copy link
Collaborator

Could you already comment on the deprecation strategy, raising a warning?

Making both options available for a time and raising a deprecation warning when eval_set if non-empty seems fine to me, if we decide to move forward with this. I'd also support a runtime error when both eval_set and eval_X are non-empty, to avoid taking on the complexity of merging those 2 inputs.

I'm sorry but I cannot invest much time in this right now (for example, looking into whether this would introduce inconsistencies with HistGradientBoostingClassifier, XGBoost, or CatBoost). If you want to see this change, getting CI working and then opening it up for others to review is probably the best path.

should I adapt all the (scikit-learn api) python tests and replace eval_set by the new eval_X, eval_y (thereby avoiding the warnings in the tests)?

No, please. As I said in scikit-learn/scikit-learn#28901 (comment), removing eval_set from LightGBM's scikit-learn estimators would be highly disruptive and requires a long deprecation cycle (more than a year, in my opinion). Throughout that time, we need to continue to test it at least as thoroughly as we have been.

@lorentzenchr
Copy link
Contributor Author

@jameslamb I'm sorry, I really need a maintainer's help. The tests in tests/python_package_test/test_dask.py fail even on the master branch, locally on my computer. I tried to play with different versions of dask, numpy, scipy, scikit-learn, no success.
TLDR: CI failure seems to be in master and not caused by this PR.

Details
pytest -x tests/python_package_test/test_dask.py::test_ranker
...
>           dask_ranker = dask_ranker.fit(dX, dy, sample_weight=dw, group=dg)

tests/python_package_test/test_dask.py:714: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../python3_lgbm/lib/python3.11/site-packages/lightgbm/dask.py:1566: in fit
    self._lgb_dask_fit(
../python3_lgbm/lib/python3.11/site-packages/lightgbm/dask.py:1068: in _lgb_dask_fit
    model = _train(
../python3_lgbm/lib/python3.11/site-packages/lightgbm/dask.py:811: in _train
    results = client.gather(futures_classifiers)
../python3_lgbm/lib/python3.11/site-packages/distributed/client.py:2565: in gather
    return self.sync(
../python3_lgbm/lib/python3.11/site-packages/lightgbm/dask.py:205: in _train_part
    data = _concat([x["data"] for x in list_of_parts])

...

 def _concat(seq: List[_DaskPart]) -> _DaskPart:
        if isinstance(seq[0], np.ndarray):
            return np.concatenate(seq, axis=0)
        elif isinstance(seq[0], (pd_DataFrame, pd_Series)):
            return concat(seq, axis=0)
        elif isinstance(seq[0], ss.spmatrix):
            return ss.vstack(seq, format="csr")
        else:
>           raise TypeError(
                f"Data must be one of: numpy arrays, pandas dataframes, sparse matrices (from scipy). Got {type(seq[0]).__name__}."
            )
E           TypeError: Data must be one of: numpy arrays, pandas dataframes, sparse matrices (from scipy). Got tuple.

../python3_lgbm/lib/python3.11/site-packages/lightgbm/dask.py:159: TypeError

@jameslamb
Copy link
Collaborator

What versions of dask / distributed do you have installed?

Searching that error in the issue tracker here has a match: #6739

I suspect you need to pin to dask<2024.12 in your environment, as we do in CI here (#6742).

@lorentzenchr
Copy link
Contributor Author

@jameslamb Thank you so much. Pinning dask<2024.12 worked fine.

@lorentzenchr lorentzenchr marked this pull request as ready for review March 10, 2025 21:34
@lorentzenchr
Copy link
Contributor Author

The remaining CI failures seem unrelated.
TODO for myself: Improve test coverage a bit.

@jameslamb jameslamb changed the title ENH add eval_X, eval_y, deprecate eval_set [python-package] scikit-learn fit() methods: add eval_X, eval_y, deprecate eval_set Mar 10, 2025
@jameslamb
Copy link
Collaborator

TODO for myself: Improve test coverage a bit.

@lorentzenchr if you are interested in continuing this I'd be happy to help with reviews. I'm supportive of adding this, for better compatibility with newer versions of scikit-learn.

@lorentzenchr
Copy link
Contributor Author

@jameslamb Yes, I‘d like to finish this. Your review would be great. Anything from me you need before you can start reviewing?

@jameslamb
Copy link
Collaborator

Great! I'd been waiting to review until you were done adding whatever tests you wanted.

If you'd like a review before then, update this to latest master and get CI passing (especially the check that LightGBM works with its oldest support scikit-learn version), then @ me and I'll be happy to provide some feedback.

@lorentzenchr
Copy link
Contributor Author

@jameslamb I rebased with master. CI failures seem unrelated (only R-packages which are not touched). So ready for review from my side.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much! Overall this is looking good. I left a couple of suggestions I'd like to see implemented.

After that I'd be happy to merge this, I'd really like to see it in the next lightgbm release.

cc @jmoralez I'd love a review from you too if you have time

eval_set: Optional[List[_LGBM_ScikitValidSet]] = None,
eval_names: Optional[List[str]] = None,
eval_X: Optional[Union[_LGBM_ScikitMatrixLike, Tuple[_LGBM_ScikitMatrixLike]]] = None,
eval_y: Optional[Union[_LGBM_LabelType, Tuple[_LGBM_LabelType]]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting new arguments in the middle of the signature like this is a breaking change from the perspective of anyone calling these functions with positional arguments.

Could you please move these new arguments to the end of the signature and make them keyword-only?

Like this:

def fit(
   self,
   X,
   # ... other arguments that are already there ...
   *,
    eval_X: Optional[Union[_LGBM_ScikitMatrixLike, Tuple[_LGBM_ScikitMatrixLike]]] = None,
    eval_y: Optional[Union[_LGBM_LabelType, Tuple[_LGBM_LabelType]]] = None,

Making all new arguments added here keyword-only would allow us to change the ordering in the future without breaking anyone because we'll know there cannot be any code calling them positionally.

eval_set = list(zip(eval_X, eval_y))
else:
eval_set = [(eval_X, eval_y)]
return eval_set
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add test cases covering all these if-else branches? e.g. using pytest.raises() for the error cases and checking that the expected error type + message are raised?

There are notes in #7031 on how to generate a coverage report with pytest-cov for LightGBM's python tests if you want to use that to double-check.

This is deprecated, use `eval_X` and `eval_y` instead.
eval_names : list of str, or None, optional (default=None)
Names of eval_set.
eval_sample_weight : {eval_sample_weight_shape}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this change a mistake? I don't see any code changes to the handling of eval_sample_weight in this PR.

Right now the doc for this says the following:

eval_sample_weight_shape="list of array (same types as ``sample_weight`` supports), or None, optional (default=None)",
eval_init_score_shape="list of array (same types as ``init_score`` supports), or None, optional (default=None)",
eval_group_shape="list of array (same types as ``group`` supports), or None, optional (default=None)",

If you're changing this because you found that "list" is overly specific and that other iterables like tuple could be provided, then I'm ok with this change but please:

  • make it in the formatting variable eval_sample_weight_shape instead of tacking on "or tuple" here
  • double-check if eval_init_score and eval_group could also be modified that way
  • add tests in test_sklearn.py covering tuples being provided for those inputs

Comment on lines +2066 to +2084
X, y = make_synthetic_regression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
cbs = [lgb.early_stopping(2)]
gbm1 = lgb.LGBMRegressor()
gbm1.fit(X_train, y_train, eval_set=[(X_test, y_test)], callbacks=cbs)
gbm2 = lgb.LGBMRegressor()
gbm2.fit(X_train, y_train, eval_X=X_test, eval_y=y_test, callbacks=cbs)
np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X))

# 2 evaluation sets
n = X_test.shape[0]
X_test1, X_test2 = X_test[: n // 2], X_test[n // 2 :]
y_test1, y_test2 = y_test[: n // 2], y_test[n // 2 :]
gbm1 = lgb.LGBMRegressor()
gbm1.fit(X_train, y_train, eval_set=[(X_test1, y_test1), (X_test2, y_test2)], callbacks=cbs)
gbm2 = lgb.LGBMRegressor()
gbm2.fit(X_train, y_train, eval_X=(X_test1, X_test2), eval_y=(y_test1, y_test2), callbacks=cbs)
np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X))
assert gbm1.evals_result_["valid_0"]["l2"][0] == pytest.approx(gbm2.evals_result_["valid_0"]["l2"][0])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
X, y = make_synthetic_regression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
cbs = [lgb.early_stopping(2)]
gbm1 = lgb.LGBMRegressor()
gbm1.fit(X_train, y_train, eval_set=[(X_test, y_test)], callbacks=cbs)
gbm2 = lgb.LGBMRegressor()
gbm2.fit(X_train, y_train, eval_X=X_test, eval_y=y_test, callbacks=cbs)
np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X))
# 2 evaluation sets
n = X_test.shape[0]
X_test1, X_test2 = X_test[: n // 2], X_test[n // 2 :]
y_test1, y_test2 = y_test[: n // 2], y_test[n // 2 :]
gbm1 = lgb.LGBMRegressor()
gbm1.fit(X_train, y_train, eval_set=[(X_test1, y_test1), (X_test2, y_test2)], callbacks=cbs)
gbm2 = lgb.LGBMRegressor()
gbm2.fit(X_train, y_train, eval_X=(X_test1, X_test2), eval_y=(y_test1, y_test2), callbacks=cbs)
np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X))
assert gbm1.evals_result_["valid_0"]["l2"][0] == pytest.approx(gbm2.evals_result_["valid_0"]["l2"][0])
X, y = make_synthetic_regression()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
params = {
"deterministic": True,
"force_row_wise": True,
"n_jobs": 1,
"seed": 708,
}
cbs = [lgb.early_stopping(2)]
gbm1 = lgb.LGBMRegressor(**params)
gbm1.fit(X_train, y_train, eval_set=[(X_test, y_test)], callbacks=cbs)
gbm2 = lgb.LGBMRegressor(**params)
gbm2.fit(X_train, y_train, eval_X=X_test, eval_y=y_test, callbacks=cbs)
np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X))
# 2 evaluation sets
n = X_test.shape[0]
X_test1, X_test2 = X_test[: n // 2], X_test[n // 2 :]
y_test1, y_test2 = y_test[: n // 2], y_test[n // 2 :]
gbm1 = lgb.LGBMRegressor(**params)
gbm1.fit(X_train, y_train, eval_set=[(X_test1, y_test1), (X_test2, y_test2)], callbacks=cbs)
gbm2 = lgb.LGBMRegressor(**params)
gbm2.fit(X_train, y_train, eval_X=(X_test1, X_test2), eval_y=(y_test1, y_test2), callbacks=cbs)
np.testing.assert_allclose(gbm1.predict(X), gbm2.predict(X))
assert gbm1.evals_result_["valid_0"]["l2"][0] == pytest.approx(gbm2.evals_result_["valid_0"]["l2"][0])

Let's please pass some parameter values to guarantee a deterministic fit, so the test won't fail due to random differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants