Baseline Models

class neer_match_utilities.baseline_models.GradientBoostingModel(model=<factory>)[source]

Gradient boosting baseline on similarity features using scikit-learn.

Designed as an alternative to the DL/NS models in neer_match, using a tree-based GradientBoostingClassifier on top of similarity features produced by AlternativeModels.

It supports: - evaluation with TP, FP, TN, FN, Accuracy, Recall, Precision, F1, MCC, - a simple summary() reporting feature importances.

Notes

  • Unlike Logit/Probit, this model has no statistical inference (SE/p-values).

  • Works well with nonlinearities and interactions in similarity features.

__init__(model=<factory>)
best_threshold(df_val, match_col='match', feature_prefix='col_', metric='mcc', thresholds=None, store_treshold=True)[source]

Find the classification threshold that maximizes a metric on validation data.

Parameters:
  • df_val (pd.DataFrame) – Validation DataFrame produced by AlternativeModels.pairwise_similarity_dataframe().

  • match_col (str, default "match") – Target column.

  • feature_prefix (str, default "col_") – Feature column prefix.

  • metric ({"mcc","f1"}, default "mcc") – Metric to maximize.

  • thresholds (np.ndarray or None) – Threshold grid. If None, uses np.linspace(0.01, 0.99, 99).

Return type:

tuple[float, dict]

Returns:

  • best_t (float) – Threshold achieving the best metric on df_val.

  • best_stats (dict) – Evaluation dict (TP/FP/TN/FN/Accuracy/Recall/Precision/F1/MCC) at best_t.

evaluate(df, match_col='match', feature_prefix='col_', threshold=0.5)[source]

Evaluate the model on a pairwise similarity DataFrame.

Returns a dict:

  • TP, FP, TN, FN (integers)

  • Accuracy, Recall, Precision, F1, MCC (floats)

Return type:

dict

fit(df, match_col='match', feature_prefix='col_', use_class_weight=False)[source]

Fit gradient boosting on a pairwise similarity DataFrame.

Parameters:
  • df (pd.DataFrame) – (Possibly subsampled) DataFrame produced by AlternativeModels.

  • match_col (str, default "match") – Name of the binary target column.

  • feature_prefix (str, default "col_") – Prefix of feature columns (similarity features).

  • use_class_weight (bool, default False) – If True, uses inverse-frequency sample weights to upweight matches. Useful if you fit on a very imbalanced dataset.

Return type:

GradientBoostingModel

predict_proba(df, feature_prefix='col_')[source]

Predict match probabilities for a pairwise similarity DataFrame.

Returns the probability for the positive class (match = 1).

Return type:

ndarray

summary(top_k=20)[source]

Return a simple “summary” as a DataFrame of feature importances.

Parameters:

top_k (int, default 20) – Number of most important features to return.

Returns:

Columns: feature, importance

Return type:

pd.DataFrame

class neer_match_utilities.baseline_models.LogitMatchingModel[source]

Logistic regression baseline on similarity features using statsmodels.

This class is designed as an alternative to the DL/NS models in neer_match, using statsmodels’ Logit on top of the similarity features produced by AlternativeModels.

It supports: - evaluation with TP, FP, TN, FN, Accuracy, Recall, Precision, F1, MCC, - full inference via summary().

__init__()
fit(df, match_col='match', feature_prefix='col_')[source]

Fit logistic regression on a pairwise similarity DataFrame.

Parameters:
  • df (pd.DataFrame) – (Possibly subsampled) DataFrame produced by AlternativeModels.

  • match_col (str, default "match") – Name of the binary target column.

  • feature_prefix (str, default "col_") – Prefix of feature columns (similarity features).

Return type:

LogitMatchingModel

class neer_match_utilities.baseline_models.ProbitMatchingModel[source]

Probit regression baseline on similarity features using statsmodels.

Same interface as LogitMatchingModel, but using a normal CDF link.

__init__()
fit(df, match_col='match', feature_prefix='col_')[source]

Fit probit regression on a pairwise similarity DataFrame.

Parameters:
  • df (pd.DataFrame) – (Possibly subsampled) DataFrame produced by AlternativeModels.

  • match_col (str, default "match") – Name of the binary target column.

  • feature_prefix (str, default "col_") – Prefix of feature columns (similarity features).

Return type:

ProbitMatchingModel

class neer_match_utilities.baseline_models.SuggestMixin[source]

Adds a NeerMatch-like .suggest(left, right, count, verbose) API to baseline models.

Requires:

  • self.predict_proba(df_pairs) implemented

  • self.similarity_map set to a SimilarityMap (or dict) describing features to compute

suggest(left, right, *, count=10, verbose=0, left_id_col=None, right_id_col=None)[source]

Return top-k candidate matches per left record (like neer_match DL models).

Return type:

DataFrame

Output columns:
  • left: integer row index into left (0..len(left)-1)

  • right: integer row index into right (0..len(right)-1)

  • prediction: match probability