Alternative Classification Models

This document covers all classification models available in neer_match_utilities for entity matching. The examples assume you have already prepared your data as described in basic training pipeline.

Baseline Models

Baseline models (Logit, Probit, GradientBoost) use the BaselineTrainingPipe class. They are faster to train and serve as good benchmarks.

Logit Model

Logistic regression using statsmodels. A simple, interpretable baseline.

from neer_match_utilities.baseline_training import BaselineTrainingPipe

training_pipeline = BaselineTrainingPipe(
    # Required
    model_name='my_logit_model',
    similarity_map=smap,
    training_data=(left_train, right_train, matches_train),
    testing_data=(left_test, right_test, matches_test),

    # Model type
    model_kind="logit",  # "logit" | "probit" | "gb"

    # ID columns (must match your data)
    id_left_col="company_id",
    id_right_col="company_id",

    # How matches dataframe is structured
    matches_id_left="left",       # column name for left IDs in matches
    matches_id_right="right",     # column name for right IDs in matches
    matches_are_indices=True,     # True if matches contain row indices, False if IDs

    # Sampling: fraction of non-matches to use during fitting
    mismatch_share_fit=1.0,       # 1.0 = use all, 0.1 = use 10%
    random_state=42,
    shuffle_fit=True,

    # Prediction threshold
    threshold=0.5,
    tune_threshold=False,         # Logit/Probit: typically use 0.5

    # Optional: validation data for threshold tuning
    validation_data=None,         # (left_val, right_val, matches_val)

    # Export settings
    base_dir=None,                # defaults to current directory
    export_model=True,
    export_stats=True,
    reload_sanity_check=True,     # verify model can be reloaded
)

training_pipeline.execute()

Probit Model

Probit regression using statsmodels. Similar to Logit but uses the cumulative normal distribution.

training_pipeline = BaselineTrainingPipe(
    model_name='my_probit_model',
    similarity_map=smap,
    training_data=(left_train, right_train, matches_train),
    testing_data=(left_test, right_test, matches_test),

    model_kind="probit",

    id_left_col="company_id",
    id_right_col="company_id",
    matches_id_left="left",
    matches_id_right="right",
    matches_are_indices=True,

    mismatch_share_fit=1.0,
    threshold=0.5,
    tune_threshold=False,

    export_model=True,
    export_stats=True,
)

training_pipeline.execute()

Gradient Boosting Model

Gradient Boosting using scikit-learn. More powerful but less interpretable.

training_pipeline = BaselineTrainingPipe(
    model_name='my_gb_model',
    similarity_map=smap,
    training_data=(left_train, right_train, matches_train),
    testing_data=(left_test, right_test, matches_test),

    model_kind="gb",

    id_left_col="company_id",
    id_right_col="company_id",
    matches_id_left="left",
    matches_id_right="right",
    matches_are_indices=True,

    # Sampling (GB often works well with subsampling)
    mismatch_share_fit=0.5,       # use 50% of non-matches

    # Threshold tuning (recommended for GB)
    tune_threshold=True,          # automatically find best threshold
    tune_metric="mcc",            # "mcc" or "f1"
    validation_data=(left_val, right_val, matches_val),  # required for tuning

    export_model=True,
    export_stats=True,
)

training_pipeline.execute()

Deep Learning Model (ANN)

The neural network model uses TrainingPipe and supports two-stage training with customizable loss functions.

from neer_match_utilities.training import TrainingPipe

training_pipeline = TrainingPipe(
    # Required
    model_name='my_ann_model',
    similarity_map=similarity_map,  # dict format, not SimilarityMap object
    training_data=(left_train, right_train, matches_train),
    testing_data=(left_test, right_test, matches_test),

    # ID columns
    id_left_col="company_id",
    id_right_col="company_id",

    # Network architecture
    initial_feature_width_scales=10,  # width multiplier for feature networks
    feature_depths=2,                  # depth of feature networks
    initial_record_width_scale=10,     # width multiplier for record network
    record_depth=4,                    # depth of record network

    # Stage 1: Soft-F1 pretraining
    stage_1=True,
    epochs_1=50,
    mismatch_share_1=0.01,            # fraction of non-matches per epoch
    stage1_loss="soft_f1",            # "soft_f1" | "binary_crossentropy" | callable

    # Stage 2: Focal loss fine-tuning
    stage_2=True,
    epochs_2=30,
    mismatch_share_2=0.1,
    gamma=2.0,                        # focal loss focusing parameter
    max_alpha=0.9,                    # max weight for positive class

    # Batch size control
    no_tm_pbatch=8,                   # target positives per batch

    # Export
    save_architecture=False,          # requires graphviz binaries
)

training_pipeline.execute()

Key Parameters Explained

Network Architecture:

  • initial_feature_width_scales: Controls the width of feature-specific networks. Higher values create wider networks.

  • feature_depths: Number of layers in each feature network.

  • initial_record_width_scale: Controls the width of the final record-comparison network.

  • record_depth: Number of layers in the record network.

Training Stages:

  • stage_1: Pretraining phase using soft-F1 loss to learn basic matching patterns.

  • stage_2: Fine-tuning phase using focal loss to focus on hard examples.

  • You can disable either stage by setting it to False.

Sampling:

  • mismatch_share_1/2: Fraction of non-matches to sample per epoch. Lower values speed up training but may reduce quality.

  • no_tm_pbatch: Target number of positive pairs per batch. The actual batch size is calculated automatically.

Focal Loss (Stage 2):

  • gamma: Focusing parameter. Higher values focus more on hard examples (typical: 1.0-3.0).

  • max_alpha: Maximum class weight for positives. Balances class imbalance.

Single-Stage Training

You can run only one training stage:

# Only Stage 1 (faster, simpler)
training_pipeline = TrainingPipe(
    model_name='my_model_stage1_only',
    similarity_map=similarity_map,
    training_data=(left_train, right_train, matches_train),
    testing_data=(left_test, right_test, matches_test),
    id_left_col="company_id",
    id_right_col="company_id",

    stage_1=True,
    epochs_1=100,
    mismatch_share_1=0.05,
    no_tm_pbatch=8,

    stage_2=False,  # Skip stage 2
)

training_pipeline.execute()

Model Comparison

Model

Linear

Speed

Interpretability

Logit

Yes

Fast

High

Probit

Yes

Fast

High

GB

No

Medium

Low

ANN

No

Slow

Low

Note on performance: Model performance depends heavily on the specific use case, dataset characteristics, and hyperparameter tuning. ANNs are more prone to getting stuck in local minima, making their results more volatile across runs. Always compare against baseline models for your specific use case.