Baseline Models
- class neer_match_utilities.baseline_models.GradientBoostingModel(model=<factory>)[source]
Gradient boosting baseline on similarity features using scikit-learn.
Designed as an alternative to the DL/NS models in neer_match, using a tree-based GradientBoostingClassifier on top of similarity features produced by AlternativeModels.
It supports: - evaluation with TP, FP, TN, FN, Accuracy, Recall, Precision, F1, MCC, - a simple summary() reporting feature importances.
Notes
Unlike Logit/Probit, this model has no statistical inference (SE/p-values).
Works well with nonlinearities and interactions in similarity features.
- __init__(model=<factory>)
- best_threshold(df_val, match_col='match', feature_prefix='col_', metric='mcc', thresholds=None, store_treshold=True)[source]
Find the classification threshold that maximizes a metric on validation data.
- Parameters:
df_val (pd.DataFrame) – Validation DataFrame produced by AlternativeModels.pairwise_similarity_dataframe().
match_col (str, default "match") – Target column.
feature_prefix (str, default "col_") – Feature column prefix.
metric ({"mcc","f1"}, default "mcc") – Metric to maximize.
thresholds (np.ndarray or None) – Threshold grid. If None, uses np.linspace(0.01, 0.99, 99).
- Return type:
tuple[float,dict]- Returns:
best_t (float) – Threshold achieving the best metric on df_val.
best_stats (dict) – Evaluation dict (TP/FP/TN/FN/Accuracy/Recall/Precision/F1/MCC) at best_t.
- evaluate(df, match_col='match', feature_prefix='col_', threshold=0.5)[source]
Evaluate the model on a pairwise similarity DataFrame.
Returns a dict:
TP, FP, TN, FN (integers)
Accuracy, Recall, Precision, F1, MCC (floats)
- Return type:
dict
- fit(df, match_col='match', feature_prefix='col_', use_class_weight=False)[source]
Fit gradient boosting on a pairwise similarity DataFrame.
- Parameters:
df (pd.DataFrame) – (Possibly subsampled) DataFrame produced by AlternativeModels.
match_col (str, default "match") – Name of the binary target column.
feature_prefix (str, default "col_") – Prefix of feature columns (similarity features).
use_class_weight (bool, default False) – If True, uses inverse-frequency sample weights to upweight matches. Useful if you fit on a very imbalanced dataset.
- Return type:
- class neer_match_utilities.baseline_models.LogitMatchingModel[source]
Logistic regression baseline on similarity features using statsmodels.
This class is designed as an alternative to the DL/NS models in neer_match, using statsmodels’ Logit on top of the similarity features produced by AlternativeModels.
It supports: - evaluation with TP, FP, TN, FN, Accuracy, Recall, Precision, F1, MCC, - full inference via summary().
- __init__()
- fit(df, match_col='match', feature_prefix='col_')[source]
Fit logistic regression on a pairwise similarity DataFrame.
- Parameters:
df (pd.DataFrame) – (Possibly subsampled) DataFrame produced by AlternativeModels.
match_col (str, default "match") – Name of the binary target column.
feature_prefix (str, default "col_") – Prefix of feature columns (similarity features).
- Return type:
- class neer_match_utilities.baseline_models.ProbitMatchingModel[source]
Probit regression baseline on similarity features using statsmodels.
Same interface as LogitMatchingModel, but using a normal CDF link.
- __init__()
- fit(df, match_col='match', feature_prefix='col_')[source]
Fit probit regression on a pairwise similarity DataFrame.
- Parameters:
df (pd.DataFrame) – (Possibly subsampled) DataFrame produced by AlternativeModels.
match_col (str, default "match") – Name of the binary target column.
feature_prefix (str, default "col_") – Prefix of feature columns (similarity features).
- Return type:
- class neer_match_utilities.baseline_models.SuggestMixin[source]
Adds a NeerMatch-like .suggest(left, right, count, verbose) API to baseline models.
Requires:
self.predict_proba(df_pairs) implemented
self.similarity_map set to a SimilarityMap (or dict) describing features to compute
- suggest(left, right, *, count=10, verbose=0, left_id_col=None, right_id_col=None)[source]
Return top-k candidate matches per left record (like neer_match DL models).
- Return type:
DataFrame
- Output columns:
left: integer row index into left (0..len(left)-1)
right: integer row index into right (0..len(right)-1)
prediction: match probability