Baseline Training

class neer_match_utilities.baseline_training.BaselineTrainingPipe(model_name, similarity_map, training_data, testing_data, validation_data=None, id_left_col='id', id_right_col='id', matches_id_left='left', matches_id_right='right', matches_are_indices=True, model_kind='gb', mismatch_share_fit=1.0, random_state=42, shuffle_fit=True, threshold=0.5, tune_threshold=True, tune_metric='mcc', base_dir=None, export_model=True, export_stats=True, reload_sanity_check=True)[source]

Orchestrates training + evaluation + export for baseline (non-DL) models: - LogitMatchingModel (statsmodels) - ProbitMatchingModel (statsmodels) - GradientBoostingModel (sklearn)

Pipeline steps

  1. Build full pairwise similarity DataFrames for train/val/test

  2. Subsample non-matches for fitting (optional)

  3. Fit selected baseline model

  4. Choose threshold (optional; recommended for GB)

  5. Evaluate on full train + test

  6. Save model via ModelBaseline.save(…)

  7. Export performance.csv + similarity_map.dill via Training.performance_statistics_export(…)

__init__(model_name, similarity_map, training_data, testing_data, validation_data=None, id_left_col='id', id_right_col='id', matches_id_left='left', matches_id_right='right', matches_are_indices=True, model_kind='gb', mismatch_share_fit=1.0, random_state=42, shuffle_fit=True, threshold=0.5, tune_threshold=True, tune_metric='mcc', base_dir=None, export_model=True, export_stats=True, reload_sanity_check=True)