# Alternative Classification Models


This document covers all classification models available in
`neer_match_utilities` for entity matching. The examples assume you have
already prepared your data as described in [basic training
pipeline](basic_training_pipeline.md).

## Baseline Models

Baseline models (Logit, Probit, GradientBoost) use the
`BaselineTrainingPipe` class. They are faster to train and serve as good
benchmarks.

### Logit Model

Logistic regression using statsmodels. A simple, interpretable baseline.

``` python
from neer_match_utilities.baseline_training import BaselineTrainingPipe

training_pipeline = BaselineTrainingPipe(
    # Required
    model_name='my_logit_model',
    similarity_map=smap,
    training_data=(left_train, right_train, matches_train),
    testing_data=(left_test, right_test, matches_test),

    # Model type
    model_kind="logit",  # "logit" | "probit" | "gb"

    # ID columns (must match your data)
    id_left_col="company_id",
    id_right_col="company_id",

    # How matches dataframe is structured
    matches_id_left="left",       # column name for left IDs in matches
    matches_id_right="right",     # column name for right IDs in matches
    matches_are_indices=True,     # True if matches contain row indices, False if IDs

    # Sampling: fraction of non-matches to use during fitting
    mismatch_share_fit=1.0,       # 1.0 = use all, 0.1 = use 10%
    random_state=42,
    shuffle_fit=True,

    # Prediction threshold
    threshold=0.5,
    tune_threshold=False,         # Logit/Probit: typically use 0.5

    # Optional: validation data for threshold tuning
    validation_data=None,         # (left_val, right_val, matches_val)

    # Export settings
    base_dir=None,                # defaults to current directory
    export_model=True,
    export_stats=True,
    reload_sanity_check=True,     # verify model can be reloaded
)

training_pipeline.execute()
```

### Probit Model

Probit regression using statsmodels. Similar to Logit but uses the
cumulative normal distribution.

``` python
training_pipeline = BaselineTrainingPipe(
    model_name='my_probit_model',
    similarity_map=smap,
    training_data=(left_train, right_train, matches_train),
    testing_data=(left_test, right_test, matches_test),

    model_kind="probit",

    id_left_col="company_id",
    id_right_col="company_id",
    matches_id_left="left",
    matches_id_right="right",
    matches_are_indices=True,

    mismatch_share_fit=1.0,
    threshold=0.5,
    tune_threshold=False,

    export_model=True,
    export_stats=True,
)

training_pipeline.execute()
```

### Gradient Boosting Model

Gradient Boosting using scikit-learn. More powerful but less
interpretable.

``` python
training_pipeline = BaselineTrainingPipe(
    model_name='my_gb_model',
    similarity_map=smap,
    training_data=(left_train, right_train, matches_train),
    testing_data=(left_test, right_test, matches_test),

    model_kind="gb",

    id_left_col="company_id",
    id_right_col="company_id",
    matches_id_left="left",
    matches_id_right="right",
    matches_are_indices=True,

    # Sampling (GB often works well with subsampling)
    mismatch_share_fit=0.5,       # use 50% of non-matches

    # Threshold tuning (recommended for GB)
    tune_threshold=True,          # automatically find best threshold
    tune_metric="mcc",            # "mcc" or "f1"
    validation_data=(left_val, right_val, matches_val),  # required for tuning

    export_model=True,
    export_stats=True,
)

training_pipeline.execute()
```

## Deep Learning Model (ANN)

The neural network model uses `TrainingPipe` and supports two-stage
training with customizable loss functions.

``` python
from neer_match_utilities.training import TrainingPipe

training_pipeline = TrainingPipe(
    # Required
    model_name='my_ann_model',
    similarity_map=similarity_map,  # dict format, not SimilarityMap object
    training_data=(left_train, right_train, matches_train),
    testing_data=(left_test, right_test, matches_test),

    # ID columns
    id_left_col="company_id",
    id_right_col="company_id",

    # Network architecture
    initial_feature_width_scales=10,  # width multiplier for feature networks
    feature_depths=2,                  # depth of feature networks
    initial_record_width_scale=10,     # width multiplier for record network
    record_depth=4,                    # depth of record network

    # Stage 1: Soft-F1 pretraining
    stage_1=True,
    epochs_1=50,
    mismatch_share_1=0.01,            # fraction of non-matches per epoch
    stage1_loss="soft_f1",            # "soft_f1" | "binary_crossentropy" | callable

    # Stage 2: Focal loss fine-tuning
    stage_2=True,
    epochs_2=30,
    mismatch_share_2=0.1,
    gamma=2.0,                        # focal loss focusing parameter
    max_alpha=0.9,                    # max weight for positive class

    # Batch size control
    no_tm_pbatch=8,                   # target positives per batch

    # Export
    save_architecture=False,          # requires graphviz binaries
)

training_pipeline.execute()
```

### Key Parameters Explained

**Network Architecture:**

- `initial_feature_width_scales`: Controls the width of feature-specific
  networks. Higher values create wider networks.
- `feature_depths`: Number of layers in each feature network.
- `initial_record_width_scale`: Controls the width of the final
  record-comparison network.
- `record_depth`: Number of layers in the record network.

**Training Stages:**

- `stage_1`: Pretraining phase using soft-F1 loss to learn basic
  matching patterns.
- `stage_2`: Fine-tuning phase using focal loss to focus on hard
  examples.
- You can disable either stage by setting it to `False`.

**Sampling:**

- `mismatch_share_1/2`: Fraction of non-matches to sample per epoch.
  Lower values speed up training but may reduce quality.
- `no_tm_pbatch`: Target number of positive pairs per batch. The actual
  batch size is calculated automatically.

**Focal Loss (Stage 2):**

- `gamma`: Focusing parameter. Higher values focus more on hard examples
  (typical: 1.0-3.0).
- `max_alpha`: Maximum class weight for positives. Balances class
  imbalance.

### Single-Stage Training

You can run only one training stage:

``` python
# Only Stage 1 (faster, simpler)
training_pipeline = TrainingPipe(
    model_name='my_model_stage1_only',
    similarity_map=similarity_map,
    training_data=(left_train, right_train, matches_train),
    testing_data=(left_test, right_test, matches_test),
    id_left_col="company_id",
    id_right_col="company_id",

    stage_1=True,
    epochs_1=100,
    mismatch_share_1=0.05,
    no_tm_pbatch=8,

    stage_2=False,  # Skip stage 2
)

training_pipeline.execute()
```

## Model Comparison

| Model  | Linear | Speed  | Interpretability |
|--------|--------|--------|------------------|
| Logit  | Yes    | Fast   | High             |
| Probit | Yes    | Fast   | High             |
| GB     | No     | Medium | Low              |
| ANN    | No     | Slow   | Low              |

**Note on performance:** Model performance depends heavily on the
specific use case, dataset characteristics, and hyperparameter tuning.
ANNs are more prone to getting stuck in local minima, making their
results more volatile across runs. Always compare against baseline
models for your specific use case.