[P] ibu-boost: a GBDT library where splits are *absolutely* rejected, not just relatively ranked[P]
I built a small gradient-boosted tree library based on the screening transform from "Screening Is Enough" (Nakanishi 2026, arXiv:2604.01178). The paper was originally written for Transformers, but the core idea — replacing relative comparison with absolute-threshold rejection — maps naturally onto GBDT split selection.
Disclaimer: I'm not affiliated with the paper's author. This is an independent implementation that applies the screening idea to GBDTs.
The idea in one paragraph
Every GBDT implementation picks the split with the highest gain among all candidates. This means the tree always splits, even if the best candidate is nearly useless. min_gain_to_split is the standard workaround, but it's an arbitrary hyperparameter that needs tuning per dataset.
ibu-boost replaces this with a screening transform:
raw_gain = G_L^2/(H_L+λ) + G_R^2/(H_R+λ) - G_total^2/(H_total+λ) norm_gain = raw_gain / H_total # N-invariant, O(1) regardless of dataset size s = 1 - exp(-norm_gain / τ) # bounded similarity in [0, 1) ρ = max(1 - r*(1-s), 0)^2 # Trim-and-Square If max(ρ) == 0 across all (feature, bin) candidates, the node becomes a leaf automatically — no split is issued. There is no min_gain_to_split to tune.
The threshold behaviour is controlled by s_w (temperature) and s_r (acceptance width), both stored in log-space, and will become learnable in a future release.
What's implemented
- Two tree types: non-oblivious (standard per-node splits) and oblivious (CatBoost-style symmetric splits — all nodes at the same depth share one split)
- Gradient boosting with MSE regression and binary log-loss
- Missing value handling: XGBoost-style learned default direction per split
- Triton GPU kernels: fused histogram scatter + screening transform, batched multi-node dispatch, full on-device gradient normalisation
- ScreeningDiagnostics: accept_rate per round — a built-in health check for over/under-rejection
- ScreeningParamSearch: K-fold grid search over (s_w, s_r)
Benchmark (California Housing, 100 rounds, oblivious tree)
| Model | RMSE | Train time |
|---|---|---|
| LightGBM (default) | 0.4711 ± 0.0042 | — |
| ibu-boost (CPU) | 0.5286 ± 0.0039 | 5.34 s |
| ibu-boost (RTX 4060 Ti) | 0.5286 ± 0.0039 | 1.70 s (3.15x) |
Gap to LightGBM is ~12% RMSE. Honest take: this is an early alpha. Part of the gap comes from s_w/s_r being fixed scalars — once they become learnable (Phase 2), the threshold should adapt per dataset. But I also suspect the gap will persist on small, clean datasets like California Housing where over-splitting isn't a real problem. The hypothesis is that absolute rejection pays off more on high-dimensional or noisy data where standard GBDTs tend to overfit via spurious splits. I haven't tested this rigorously yet — if you have a go-to tabular benchmark suite, I'd love to hear about it.
Kernel-level speedup (N=65536, F=8, B=255): 51x over NumPy reference.
Install
pip install ibu-boost # NumPy reference only pip install "ibu-boost[triton]" # + Triton GPU kernels (Linux / Windows CUDA) Quick start
from ibu_boost import ScreeningBooster model = ScreeningBooster( n_estimators=100, learning_rate=0.1, max_depth=6, tree_type="oblivious", # CatBoost-style symmetric splits device="cuda", # requires [triton] extra ) model.fit(X_train, y_train) print(f"Accept rate: {model.mean_accept_rate():.1%}") # screening health check Links
What I'd like feedback on
- Screening calibration: Does the absolute-rejection idea feel useful in practice, or does it just move the tuning problem from min_gain_to_split to (s_w, s_r)?
- Benchmark suggestions: Which tabular datasets or benchmark suites would best stress-test the "auto-stop on noise" property?
- Triton kernel design: The histogram scatter uses sample-parallel atomic_add, which is non-deterministic. Any tips on deterministic alternatives that don't kill throughput?
Happy to discuss the theory or implementation details.
[link] [comments]
Want to read more?
Check out the full article on the original site