Baseline Models

class neer_match_utilities.baseline_models.GradientBoostingModel(model=<factory>)[source]

Gradient boosting baseline on similarity features using scikit-learn.

Designed as an alternative to the DL/NS models in neer_match, using a tree-based GradientBoostingClassifier on top of similarity features produced by AlternativeModels.

It supports: - evaluation with TP, FP, TN, FN, Accuracy, Recall, Precision, F1, MCC, - a simple summary() reporting feature importances.

Notes

Unlike Logit/Probit, this model has no statistical inference (SE/p-values).
Works well with nonlinearities and interactions in similarity features.

__init__(model=<factory>)

best_threshold(df_val, match_col='match', feature_prefix='col_', metric='mcc', thresholds=None, store_treshold=True)[source]

Find the classification threshold that maximizes a metric on validation data.

Parameters:

df_val (pd.DataFrame) – Validation DataFrame produced by AlternativeModels.pairwise_similarity_dataframe().
match_col (str, default "match") – Target column.
feature_prefix (str, default "col_") – Feature column prefix.
metric ({"mcc","f1"}, default "mcc") – Metric to maximize.
thresholds (np.ndarray or None) – Threshold grid. If None, uses np.linspace(0.01, 0.99, 99).

Return type:

tuple[float, dict]

Returns:

best_t (float) – Threshold achieving the best metric on df_val.
best_stats (dict) – Evaluation dict (TP/FP/TN/FN/Accuracy/Recall/Precision/F1/MCC) at best_t.

evaluate(df, match_col='match', feature_prefix='col_', threshold=0.5)[source]

Evaluate the model on a pairwise similarity DataFrame.

Returns a dict:

TP, FP, TN, FN (integers)
Accuracy, Recall, Precision, F1, MCC (floats)

Return type:: dict

fit(df, match_col='match', feature_prefix='col_', use_class_weight=False)[source]

Fit gradient boosting on a pairwise similarity DataFrame.

Parameters:

df (pd.DataFrame) – (Possibly subsampled) DataFrame produced by AlternativeModels.
match_col (str, default "match") – Name of the binary target column.
feature_prefix (str, default "col_") – Prefix of feature columns (similarity features).
use_class_weight (bool, default False) – If True, uses inverse-frequency sample weights to upweight matches. Useful if you fit on a very imbalanced dataset.

Return type:

GradientBoostingModel

predict_proba(df, feature_prefix='col_')[source]

Predict match probabilities for a pairwise similarity DataFrame.

Returns the probability for the positive class (match = 1).

Return type:: ndarray

summary(top_k=20)[source]

Return a simple “summary” as a DataFrame of feature importances.

Parameters:: top_k (int, default 20) – Number of most important features to return.
Returns:: Columns: feature, importance
Return type:: pd.DataFrame

class neer_match_utilities.baseline_models.LogitMatchingModel[source]

Logistic regression baseline on similarity features using statsmodels.

This class is designed as an alternative to the DL/NS models in neer_match, using statsmodels’ Logit on top of the similarity features produced by AlternativeModels.

It supports: - evaluation with TP, FP, TN, FN, Accuracy, Recall, Precision, F1, MCC, - full inference via summary().

__init__()

fit(df, match_col='match', feature_prefix='col_')[source]

Fit logistic regression on a pairwise similarity DataFrame.

Parameters:

df (pd.DataFrame) – (Possibly subsampled) DataFrame produced by AlternativeModels.
match_col (str, default "match") – Name of the binary target column.
feature_prefix (str, default "col_") – Prefix of feature columns (similarity features).

Return type:

LogitMatchingModel

class neer_match_utilities.baseline_models.ProbitMatchingModel[source]

Probit regression baseline on similarity features using statsmodels.

Same interface as LogitMatchingModel, but using a normal CDF link.

__init__()

fit(df, match_col='match', feature_prefix='col_')[source]

Fit probit regression on a pairwise similarity DataFrame.

Parameters:

df (pd.DataFrame) – (Possibly subsampled) DataFrame produced by AlternativeModels.
match_col (str, default "match") – Name of the binary target column.
feature_prefix (str, default "col_") – Prefix of feature columns (similarity features).

Return type:

ProbitMatchingModel

class neer_match_utilities.baseline_models.SuggestMixin[source]

Adds a NeerMatch-like .suggest(left, right, count, verbose) API to baseline models.

Requires:

self.predict_proba(df_pairs) implemented
self.similarity_map set to a SimilarityMap (or dict) describing features to compute

suggest(left, right, *, count=10, verbose=0, left_id_col=None, right_id_col=None)[source]

Return top-k candidate matches per left record (like neer_match DL models).

Return type:: DataFrame

Output columns:

left: integer row index into left (0..len(left)-1)
right: integer row index into right (0..len(right)-1)
prediction: match probability