Reweighting methods

This page describes the reweighting backends implemented in mcreweight and how each method computes new MC event weights. Where relevant, differences with hep_ml.reweight are noted.

Overview

mcreweight exposes nine user-facing training modes. They fall into four main families:

hep_ml-native methods

GB: direct use of hep_ml.reweight.GBReweighter.
Folding: direct use of hep_ml.reweight.FoldingReweighter around GB.

ONNX-exportable gradient-boosting methods

ONNXGB: custom tree-based reweighter that reproduces the signed-weight logic of hep_ml while remaining exportable to ONNX.
ONNXFolding: K-fold ensemble of ONNXGB models.

Iterative classifier-ratio methods

XGB: iterative reweighter that trains an xgboost.XGBClassifier at each stage and converts classifier probabilities into multiplicative weight updates.
XGBFolding: K-fold ensemble of XGB models.
NN: iterative reweighter that uses a sklearn.neural_network.MLPClassifier at each stage.
NNFolding: K-fold ensemble of NN models.

Histogram method

Bins: N-dimensional histogram ratio reweighter with neighbor smoothing.

Quick selection guide:

GB / Folding: closest to the original hep_ml package;
ONNXGB / ONNXFolding: same boosting logic as hep_ml but ONNX-exportable;
XGB / NN: iterative classifier-ratio correction;
XGBFolding / NNFolding / ONNXFolding: K-fold variants to reduce bias;
Bins: non-parametric histogram-ratio baseline, best for low dimensions.

All methods follow the same high-level workflow:

split MC and data into training and testing subsets;
fit the selected reweighter on the training subset;
predict new MC weights;
optionally clip very large predicted weights to the 99th percentile (see below);
save both the trained model and the produced weight arrays.

Clipping behavior differs by method. For GB, ONNXGB, Bins, GBFolding, and ONNXFolding, clipping is applied only when --clip-weights (YAML: reweighting.clip_weights) is enabled. For XGB, NN, XGBFolding, and NNFolding, clipping is always applied as part of the iterative update.

The training entry points live in src/mcreweight/train.py and the ONNX-based implementations live in src/mcreweight/models/onnxreweighter.py and src/mcreweight/models/onnxfolding.py.

Method-by-method behavior

GB

GB is a thin wrapper around hep_ml.reweight.GBReweighter. All loss and tree-update logic comes from hep_ml; the trained object is serialized with joblib and weights are predicted via hep_ml’s own predict_weights. Use this when compatibility with the original hep_ml implementation is the primary requirement.

ONNXGB

ONNXGB reimplements the GBReweighter logic with plain scikit-learn regression trees so that every stage can be exported to ONNX. It is not a generic classifier-to-ratio method: it mirrors the signed-weight boosting strategy of hep_ml directly.

At each stage, MC and data are concatenated, a regression tree is fit on signed residuals (MC label 1, data label 0, with per-class weight normalization), and the leaf values are replaced with the log ratio of target to original weighted occupancies. The final event weight is original_weight * exp(score).

The leaf update is regularized as follows:

\[\Delta_{\mathrm{leaf}} = \log\left(w_{\mathrm{target}} + \lambda\right) - \log\left(w_{\mathrm{original}} + \lambda\right),\]

where lambda is loss_regularization. Adding lambda prevents infinite updates in empty or nearly empty leaves and keeps the correction well-behaved.

The key differences from the other methods:

vs. GB: same intent, different implementation — ONNXGB uses scikit-learn trees instead of the external hep_ml estimator, enabling ONNX export;
vs. XGB/NN: keeps the signed-weight boosting logic rather than converting classifier probabilities into log-ratio updates.

XGB

XGB estimates the density ratio between data and MC through an iterative sequence of binary classifiers, rather than reproducing the hep_ml loss. A single classifier often captures only the dominant separation; by refitting after each weight update, the method progressively corrects the residual mismatch in the already-reweighted sample.

At each iteration \(t\):

MC events carry their current weights \(w_t(x)\);
data events keep fixed target weights;
an xgboost.XGBClassifier is trained to distinguish MC from data;
its output probability \(p_t(x)\) for the MC class is converted into a log-ratio correction;
MC weights are updated multiplicatively.

The stage update is

\[\delta_t(x) = \log\frac{1 - p_t(x)}{p_t(x)},\]

followed by clipping and learning-rate damping:

\[F_{t+1}(x) = F_t(x) + \eta \cdot \mathrm{clip}\left(\delta_t(x), -c, c\right),\]

and the final weights are

\[w(x) = \exp\left(\mathrm{clip}(\log w_0(x) + F(x), -m, m)\right).\]

where eta = mixing_learning_rate, c = clip_delta, and m = max_log_weight.

Intuitively, \(\delta_t(x)\) is positive when the classifier finds the event more data-like (weight should increase) and negative when it finds it more MC-like (weight should decrease). The learning rate eta and clip bounds prevent any single stage from making an extreme correction.

At each stage scale_pos_weight is updated to reflect the current weighted class balance, and negative training weights are clipped to zero for estimator compatibility.

NN

NN uses exactly the same iterative log-ratio update as XGB, with an sklearn.neural_network.MLPClassifier as the stage classifier instead of XGBClassifier. All clipping and damping parameters work identically.

If the installed scikit-learn version does not accept sample_weight in MLPClassifier, the implementation falls back to unweighted stage fits and prints a warning. Use this method when smooth, non-tree decision boundaries are preferred.

Bins

Bins computes the density ratio as a direct N-dimensional histogram ratio in transformed feature space:

fit the configured feature transform on the combined MC+data sample;
define per-variable bin edges from target-data quantiles;
fill weighted MC and data histograms;
smooth both histograms by averaging with immediate neighbors;
compute H_data / H_mc with epsilon regularization to avoid division by zero;
assign each event the ratio value of its bin.

This is the most transparent method in the package. Because bin counts grow exponentially with the number of dimensions, it is only reliable for a small number of training variables. In practice it is strongest in one or two dimensions, can still be useful up to roughly four with enough population, and should otherwise be treated as a rough baseline rather than the default choice.

Folding variants

The Folding variants (Folding, ONNXFolding, XGBFolding, NNFolding) wrap a base reweighter in a K-fold procedure. Each fold is trained on n_folds - 1 subsets and applied to the held-out subset, so that every event receives a weight from a model that was not trained on it. This reduces the bias that arises when weights are predicted on the same data used for training.

The folding variants differ in how fold predictions are aggregated:

hep_ml folding (Folding)

Delegates to hep_ml.reweight.FoldingReweighter; predictions are effectively out-of-fold when the same dataset is passed back in order.

mcreweight ONNX folding (ONNXFolding, XGBFolding, NNFolding)

Trains one model per fold and combines predictions across folds. Available aggregation modes:

weighted_geometric (default): geometric mean weighted by the inverse of each fold’s validation error;
geometric: unweighted geometric mean;
median: per-event median across folds.

Data visualization and diagnostics

The training and application pipelines produce a set of standard plots under plots/. These figures are meant to answer slightly different questions:

are MC and data already mismatched before training;
does reweighting improve the agreement on the variables used for training;
does the improvement transfer to variables that were not used for training;
are the learned weights numerically well behaved;
can an independent classifier still distinguish reweighted MC from data;
where in phase space the remaining mismodelling is concentrated;
which input variables drive the learned correction.

Input and monitoring distributions

The one-dimensional histogram outputs are the most direct validation plots.

input_features_training.png and input_features_testing.png: These show the distributions of the training variables before reweighting, separately for the train and test splits. They are the baseline mismatch plots. Large pull structures here indicate the differences that the reweighter is expected to learn.
input_features_training_transformed.png and input_features_testing_transformed.png: These show the same variables after the optional preprocessing transform (for example yeo-johnson or quantile). They are useful to verify what representation the ONNX-capable methods actually see during training.
other_vars_training.png and other_vars_testing.png: These correspond to the monitoring variables, called other_vars in the output filenames. They are not used to train the reweighter. Instead they are held out as a transfer test: if reweighting improves these variables too, the correction is more likely to reflect genuine phase-space mismodelling rather than simple overfitting of the training inputs.
input_features_<method>_weighted.png: These show the training variables after applying the weights predicted by a given method. This is the main post-training check. A good result is one in which the reweighted MC moves closer to the data histogram and the pull panel becomes more centered around zero.
other_vars_<method>_weighted.png: These show the same post-training comparison for the monitoring variables. Improvements here are especially informative because these variables were not part of the direct optimization target.

When applying an already trained model, the corresponding output names are input_features_reweighted.png and other_vars_reweighted.png. They play the same role, but now for the separately processed output sample.

Correlation matrices

corr_mc.png and corr_data.png display the pairwise correlation matrices of the training variables before reweighting.

These plots are useful because one-dimensional agreement is not enough: two samples can match marginal distributions and still differ strongly in their joint structure. The correlation matrices give a compact first view of whether important linear relationships differ between MC and data before training.

Weight distributions

weight_distributions.png shows the distribution of the predicted event weights for each trained method.

This plot is primarily a stability diagnostic:

a narrow distribution centered near one usually indicates a mild correction;
a broad tail can be acceptable, but may signal that the method must strongly upweight a small region of phase space;
extremely long tails or spikes at very large weights are warning signs for statistical instability and for downstream analyses that reuse the weights.

Classifier-based diagnostics

Several plots are built from a fresh classifier trained after reweighting to separate reweighted MC from data. These are not the reweighters themselves. They are a common external probe of how distinguishable the two samples remain.

roc_curve.png: This shows the ROC curve of that diagnostic classifier for each method. If reweighting is effective, the classifier should struggle to separate the two samples, and the curve should move closer to the diagonal. Equivalently, the AUC should move closer to 0.5.
classifier_output.png: This shows the classifier-score distributions for reweighted MC and for data. It is often easier to interpret than the ROC curve because it directly shows whether the diagnostic classifier assigns similar scores to both samples. The plot also reports a weighted KS statistic, which summarizes the mismatch between the two score distributions.

The term “output distribution” in this context therefore refers to the distribution of the diagnostic classifier output score, not to the final physics variables themselves.

2D score and pull maps

score_map_<method>.png

This plot shows the mean diagnostic-classifier score in two-dimensional bins of all pairs of training variables. It answers the question: in which regions of phase space does the diagnostic classifier still find the reweighted MC more MC-like or more data-like? Structured hot spots indicate localized residual mismodelling even when one-dimensional projections look acceptable.

pull_map_<method>.png

This plot shows the two-dimensional pull,

\[\frac{\rho_{\mathrm{data}} - \rho_{\mathrm{MC}}} {\sqrt{\sigma^2_{\mathrm{data}} + \sigma^2_{\mathrm{MC}}}},\]

in bins of every pair of training variables. A value near zero means local agreement within uncertainty, while large positive or negative values point to regions where the reweighted MC is still under- or over-populated with respect to data.

The difference between the two diagnostics is:

the score map is classifier-based and tells you where residual separation is still easy for a learned discriminator;
the pull map is histogram-based and tells you where the weighted local event densities still disagree.

SHAP feature-importance plots

feature_importance_<method>.png shows SHAP summary values for non-folding methods when shap: true is enabled.

SHAP stands for SHapley Additive exPlanations. In this context it measures how much each input variable contributes to the model’s predicted log weight for an event, relative to a reference expectation.

The SHAP beeswarm plot should be read as follows:

each point is one event;
the horizontal position is the SHAP value, meaning the signed contribution of that feature to increasing or decreasing the predicted log weight;
the color encodes whether the feature value itself is low or high;
features higher in the plot have larger overall impact on the model output.

These plots do not by themselves tell you whether a model is “good” or “bad”. They tell you which variables the reweighter is using most strongly to build its correction and in which direction they influence the learned weights.

Loss function and update mechanics

Two distinct loss families are used across the methods.

`hep_ml`-style signed boosting

Used by GB and reimplemented by ONNXGB.

The goal is to fit an additive model for \(\log(p_{\text{data}}/p_{\text{MC}})\), the logarithm of the density ratio. At each stage:

event weights are normalized separately per class;
the current event importance is updated as sample_weight * exp(y * score), where y is +1 for MC and -1 for data;
trees are fit on the absolute normalized weights;
leaf values are rewritten from the ratio of target to original weighted occupancies.

The tree structure captures where the samples differ; the leaf rewrite converts that structure into a direct density-ratio correction.

Classifier-ratio iterative updates

Used by XGB and NN.

Instead of a custom boosting loss, these methods solve a weighted binary classification problem between MC and data at each stage and convert the classifier output into a log density-ratio estimate. The sign is intuitive:

p(x) > 0.5 for the MC class → event looks too MC-like → weight decreases;
p(x) < 0.5 for the MC class → event looks more data-like → weight increases.

The three numerical controls in the update equations serve distinct purposes:

clip_delta: prevents any single stage from making an overconfident jump;
max_log_weight: caps the total accumulated log-weight globally;
mixing_learning_rate: dampens each stage correction to stabilize training.

Because each stage is trained on the currently reweighted MC, it targets only the residual mismatch left by previous updates, rather than re-learning the same dominant discrepancy.

Validation and early stopping

The iterative ONNX methods (ONNXGB, XGB, NN) add a validation loop that is not part of the original hep_ml API.

At each stage, the mean weighted Kolmogorov-Smirnov distance across all training variables is computed on a held-out validation subset. Training stops early when this mean KS fails to improve for reweight_early_stopping_rounds consecutive checks.

This provides a physics-motivated stopping criterion: the model stops when it no longer reduces observable MC-to-data mismatches, not just when classifier loss plateaus.

Optuna hyperparameter optimization

When n_trials > 0, mcreweight runs an Optuna study before the final training step, supporting GB, ONNXGB, XGB, and NN.

For each trial, the package trains the candidate reweighter, predicts new MC weights, then measures how well a fresh classifier can still separate the reweighted MC from data. The objective is the AUC of that diagnostic classifier: lower is better, since a well-reweighted sample should be harder to distinguish from data. Studies are run with Optuna’s TPE sampler (seed=42).

The sampler is Optuna’s TPE sampler with seed=42, and the study direction is minimize.

Cached studies

Optuna studies are cached under weightsdir as:

optuna_study_<classifier_type>_<flattened_training_vars>.pkl

If that file already exists, the study is loaded instead of recomputed.

Seed trials

Before optimization starts, one manually chosen initial trial is enqueued:

GB: gb_n_estimators=100, gb_learning_rate=0.1, gb_max_depth=5, min_samples_leaf=200, subsample=1.0
ONNXGB: gb_n_estimators=100, gb_learning_rate=0.1, gb_max_depth=4, min_samples_leaf=200, loss_regularization=5.0, subsample=1.0
XGB: n_iterations=5, mixing_learning_rate=0.1, xgb_learning_rate=0.1, max_depth=6, subsample=0.9, reg_alpha=1.0, reg_lambda=5.0
NN: n_iterations=5, mixing_learning_rate=0.1, hidden1=64, hidden2=32, alpha=1e-4, nn_learning_rate_init=1e-3, batch_size=1024

Search spaces

The current Optuna intervals are:

Shared iterative parameters for `XGB` and `NN`

n_iterations: integer in [5, 25]
mixing_learning_rate: log-uniform float in [0.05, 0.3]

`GB` search space

gb_n_estimators: integer in [50, 150] with step 10
gb_learning_rate: log-uniform float in [0.05, 0.3]
gb_max_depth: integer in [3, 8] with step 1
min_samples_leaf: integer in [200, 1200] with step 200
subsample: float in [0.3, 1.0] with step 0.1

`ONNXGB` search space

gb_n_estimators: integer in [50, 150] with step 10
gb_learning_rate: log-uniform float in [0.05, 0.3]
gb_max_depth: integer in [3, 8] with step 1
min_samples_leaf: integer in [200, 1200] with step 200
loss_regularization: log-uniform float in [1.0, 20.0]
subsample: float in [0.3, 1.0] with step 0.1

`XGB` base-estimator search space

xgb_learning_rate: log-uniform float in [0.05, 0.3]
max_depth: integer in [4, 8] with step 1
subsample: float in [0.6, 1.0] with step 0.1
reg_alpha: float in [0.0, 5.0] with step 0.5
reg_lambda: float in [1.0, 10.0] with step 1

`NN` base-estimator search space

hidden1: integer in [32, 128] with step 16
hidden2: integer in [16, 64] with step 16
alpha: log-uniform float in [1e-6, 1e-2]
nn_learning_rate_init: log-uniform float in [1e-4, 5e-3]
batch_size: categorical choice among 256, 512, 1024
max_iter: integer in [50, 180] with step 10

How tuned parameters are reused

After the study finishes, the final training functions read study.best_params and map them onto the concrete training backends:

GB uses the tuned boosting parameters directly in hep_ml.reweight.GBReweighter;
ONNXGB runs its own native Optuna objective with ONNXGBReweighter and reuses the tuned tree/update parameters directly in the final ONNX-exportable training pass;
XGB combines tuned iterative parameters with tuned XGBoost base-estimator parameters in ONNXIXGBReweighter;
NN combines tuned iterative parameters with tuned MLP base-estimator parameters in ONNXINNReweighter.

Feature transformations

All custom ONNX-capable methods can apply an optional feature transform before training:

quantile;
yeo-johnson;
signed-log;
scaler.

The transform is always fitted once on the combined MC+data sample, then reused for both training and inference. This is important because it prevents the MC and data samples from being mapped into different feature spaces.

Main differences with `hep_ml`

Relative to the algorithms documented at hep_ml.reweight, the main differences in mcreweight are:

GB and Folding are direct hep_ml wrappers, while ONNXGB, XGB, NN, and the ONNX folding classes are package-native implementations.
ONNXGB aims at behavioral compatibility with hep_ml.GBReweighter but is implemented with exportable stage trees so the trained model can be served through ONNX Runtime.
XGB and NN are not hep_ml algorithms. They use iterative classifier-based log-ratio updates instead of the custom signed boosting loss described for GBReweighter in hep_ml.
Bins is conceptually close to hep_ml.BinsReweighter but the smoothing implementation differs. hep_ml documents a Gaussian filter, while mcreweight uses repeated averaging with immediate neighbors.
The ONNX folding classes use built-in fold scoring and support weighted geometric aggregation. The hep_ml folding interface instead exposes a user-provided vote function.
The iterative ONNX methods add validation-driven early stopping based on the mean weighted KS distance across features. This is not part of the hep_ml.reweight page API.
mcreweight standardizes model persistence across methods and saves ONNX-exported stage models for deployment, which is outside the scope of the hep_ml reweighter documentation.

Which method to use

As a rule of thumb:

use GB if you want the closest behavior to the original hep_ml implementation;
use ONNXGB if you want similar boosting logic but need ONNX export;
use XGB if you want a powerful tree-based iterative classifier reweighter;
use NN if a neural iterative classifier is a better inductive bias for the problem;
use folding variants when you will predict on the same sample used for training and want less biased event-by-event weights;
use Bins only for low-dimensional problems where interpretability matters more than flexibility.