Training

class neer_match_utilities.training.Training(similarity_map, df_left=Empty DataFrame Columns: [] Index: [], df_right=Empty DataFrame Columns: [] Index: [], id_left='id', id_right='id')[source]

A class for managing and evaluating training processes, including reordering matches, evaluating performance metrics, and exporting models.

Inherits:

SuperClass : Base class providing shared attributes and methods.

evaluate_dataframe(evaluation_test, evaluation_train)[source]

Combines and evaluates test and training performance metrics.

Parameters:
  • evaluation_test (dict) – Dictionary containing performance metrics for the test dataset.

  • evaluation_train (dict) – Dictionary containing performance metrics for the training dataset.

Returns:

A DataFrame with accuracy, precision, recall, F-score, and a timestamp for both test and training datasets.

Return type:

pd.DataFrame

matches_reorder(matches, matches_id_left, matches_id_right)[source]

Reorders a matches DataFrame to include indices from the left and right DataFrames instead of their original IDs.

Parameters:
  • matches (pd.DataFrame) – DataFrame containing matching pairs.

  • matches_id_left (str) – Column name in the matches DataFrame corresponding to the left IDs.

  • matches_id_right (str) – Column name in the matches DataFrame corresponding to the right IDs.

Returns:

A DataFrame with columns left and right, representing the indices of matching pairs in the left and right DataFrames.

Return type:

pd.DataFrame

performance_statistics_export(model, model_name, target_directory, evaluation_train={}, evaluation_test={})[source]

Exports the trained model, similarity map, and evaluation metrics to the specified directory.

Parameters:

modelModel object

The trained model to export.

model_namestr

Name of the model to use as the export directory name.

target_directoryPath

The target directory where the model will be exported.

evaluation_traindict, optional

Performance metrics for the training dataset (default is {}).

evaluation_testdict, optional

Performance metrics for the test dataset (default is {}).

Returns:

: None

Notes:

  • The method creates a subdirectory named after model_name inside target_directory.

  • If evaluation_train and evaluation_test are provided, their metrics are saved as a CSV file.

  • Similarity maps are serialized using dill and saved in the export directory.

class neer_match_utilities.training.TrainingPipe(model_name, epochs_1, mismatch_share_1, epochs_2, mismatch_share_2, no_tm_pbatch, gamma, max_alpha, training_data, testing_data, similarity_map, initial_feature_width_scales=10, feature_depths=2, initial_record_width_scale=10, record_depth=4)[source]

Orchestrates the full training and evaluation process of a deep learning record-linkage model using a user-supplied similarity map and preprocessed data.

The class handles both training phases (soft-F1 pretraining and focal-loss fine-tuning), dynamic learning-rate scheduling, and automatic weight-decay adaptation. It also exports checkpoints, final models, and evaluation statistics for reproducibility.

Parameters:

model_namestr

Name assigned to the trained model. A corresponding subdirectory is created under the project directory to store checkpoints and exports.

test_ratiofloat

Proportion of data reserved for testing, used for logging and performance tracking.

epochs_1int

Number of training epochs during the first phase (soft-F1 pretraining).

mismatch_share_1float

Fraction of all possible negative (non-matching) pairs used during Round 1 training.

epochs_2int

Number of training epochs during the second phase (focal-loss fine-tuning).

mismatch_share_2float

Fraction of all possible negative pairs used during Round 2 training.

no_tm_pbatchint

Target number of positive (matching) pairs per batch. Used to adapt batch size dynamically via the required_batch_size heuristic.

gammafloat

Focusing parameter of the focal loss (applies in Round 2). Larger values emphasize hard-to-classify examples.

max_alphafloat

Maximum weighting factor for the positive class in the focal loss. Prevents instability for extremely imbalanced datasets.

training_datatuple or dict

Preprocessed training data in one of the following formats: - Tuple: (left_train, right_train, matches_train) - Dict: {“left”: left_train, “right”: right_train, “matches”: matches_train}

Each element must be a pandas.DataFrame containing an id_unique column.

testing_datatuple or dict

Preprocessed testing data in one of the following formats: - Tuple: (left_test, right_test, matches_test) - Dict: {“left”: left_test, “right”: right_test, “matches”: matches_test}

Each element must be a pandas.DataFrame containing an id_unique column.

similarity_mapdict

User-defined similarity configuration mapping variable names to similarity measures. Must follow the format accepted by SimilarityMap.

Returns:

None

Notes:

  • The pipeline assumes that the data have already been preprocessed, formatted, and tokenized.

  • Round 1 (soft-F1 phase) initializes the model and emphasizes balanced learning across classes.

  • Round 2 (focal-loss phase) refines the model to focus on hard-to-classify examples.

  • Dynamic heuristics are used to automatically infer:
    • Batch size (via expected positive density)

    • Peak learning rate (scaled with batch size, positives per batch, and parameter count)

    • Weight decay (adjusted based on model size and learning rate)

  • Model checkpoints, histories, and evaluation reports are stored in subdirectories named after the provided model_name.

  • The final model, similarity map, and performance metrics are exported to disk using the Training.performance_statistics_export method for reproducibility.

class WarmupCosine(peak_lr, warmup_steps, total_steps, min_lr_ratio=0.1, name=None)[source]
__init__(peak_lr, warmup_steps, total_steps, min_lr_ratio=0.1, name=None)[source]
classmethod from_config(config)[source]

Instantiates a LearningRateSchedule from its config.

Parameters:

config – Output of get_config().

Returns:

A LearningRateSchedule instance.

__init__(model_name, epochs_1, mismatch_share_1, epochs_2, mismatch_share_2, no_tm_pbatch, gamma, max_alpha, training_data, testing_data, similarity_map, initial_feature_width_scales=10, feature_depths=2, initial_record_width_scale=10, record_depth=4)[source]
neer_match_utilities.training.alpha_balanced(left, right, matches, mismatch_share=1.0, max_alpha=0.95)[source]

Compute α so that α*N_pos = (1-α)*N_neg.

Parameters:
  • left (pandas.DataFrame)

  • right (pandas.DataFrame)

  • matches (pandas.DataFrame)

Returns:

α in [0,1] for focal loss (positive-class weight).

Return type:

float

neer_match_utilities.training.combined_loss(weight_f1=0.5, epsilon=1e-07, alpha=0.99, gamma=1.5)[source]

Combined loss: weighted sum of Soft F1 loss and Focal Loss for imbalanced binary classification.

This loss blends the advantages of a differentiable F1-based objective (which balances precision and recall) with the sample-focusing property of Focal Loss (which down-weights easy examples). By tuning weight_f1, you can interpolate between solely optimizing for F1 score (when weight_f1 = 1.0) and solely focusing on hard examples via focal loss (when weight_f1 = 0.0).

Parameters:
  • weight_f1 (float, default=0.5) – Mixing coefficient in [0, 1]. - weight_f1 = 1.0: optimize only Soft F1 loss. - weight_f1 = 0.0: optimize only Focal Loss. - Intermediate values blend the two objectives proportionally.

  • epsilon (float, default=1e-7) – Small stabilizer for Soft F1 calculation. Must be > 0.

  • alpha (float, default=0.25) – Balancing factor for Focal Loss, weighting the positive (minority) class. Must lie in [0, 1].

  • gamma (float, default=2.0) – Focusing parameter for Focal Loss. - gamma = 0 reduces to weighted BCE. - Larger gamma emphasizes harder (misclassified) examples.

Returns:

A function loss(y_true, y_pred) that computes

\[\text{CombinedLoss} = \text{weight\_f1} \cdot \text{SoftF1}(y, \hat{y};\,\varepsilon) + (1 - \text{weight\_f1}) \cdot \text{FocalLoss}(y, \hat{y};\,\alpha, \gamma).\]

Minimizing this combined loss encourages both a high F1 score and focus on hard-to-classify samples.

Return type:

callable

Raises:

ValueError – If weight_f1 is not in [0, 1], or if epsilon <= 0, or if alpha is not in [0, 1], or if gamma < 0.

Notes

  • Soft F1 loss: 1 - \text{SoftF1}, where

    \[\text{SoftF1} = \frac{2\,TP + \varepsilon}{2\,TP + FP + FN + \varepsilon}.\]

    Here TP, FP, and FN are soft counts computed from probabilities.

  • Focal Loss down-weights well-classified examples to focus learning on difficult cases.

References

  • Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection. ICCV.

  • Bénédict, G., Koops, V., Odijk, D., & de Rijke, M. (2021). SigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv:2108.10566.

Examples

import tensorflow as tf
loss_fn = combined_loss(weight_f1=0.5, epsilon=1e-6, alpha=0.25, gamma=2.0)

y_true = tf.constant([[1, 0, 1]], dtype=tf.float32)
y_pred = tf.constant([[0.9, 0.2, 0.7]], dtype=tf.float32)

value = loss_fn(y_true, y_pred)
print("Combined loss:", float(value.numpy()))
neer_match_utilities.training.focal_loss(alpha=0.99, gamma=1.5)[source]

Focal Loss function for binary classification tasks.

Focal Loss is designed to address class imbalance by assigning higher weights to the minority class and focusing the model’s learning on hard-to-classify examples. It reduces the loss contribution from well-classified examples, making it particularly effective for imbalanced datasets.

Parameters:
  • alpha (float, optional, default=0.75) –

    Weighting factor for the positive class (minority class).

    • Must be in the range [0, 1].

    • A higher value increases the loss contribution from the positive class (underrepresented class) relative to the negative class (overrepresented class).

  • gamma (float, optional, default=2.0) –

    Focusing parameter that reduces the loss contribution from easy examples.

    • gamma = 0: No focusing, equivalent to Weighted Binary Cross-Entropy Loss (if alpha is set to 0.5).

    • gamma > 0: Focuses more on hard-to-classify examples.

    • Larger values emphasize harder examples more strongly.

Returns:

loss – A loss function that computes the focal loss given the true labels (y_true) and predicted probabilities (y_pred).

Return type:

callable

Raises:

ValueError – If alpha is not in the range [0, 1].

Notes

  • The positive class (minority or underrepresented class) is weighted by alpha.

  • The negative class (majority or overrepresented class) is automatically weighted by 1 - alpha.

  • Ensure alpha is set appropriately to reflect the level of imbalance in the dataset.

References

Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection. In ICCV.

Explanation of Key Terms

  • Positive Class (Underrepresented):

    • Refers to the class with fewer examples in the dataset.

    • Typically weighted by alpha, which should be greater than 0.5 in highly imbalanced datasets.

  • Negative Class (Overrepresented):

    • Refers to the class with more examples in the dataset.

    • Its weight is automatically 1 - alpha.

neer_match_utilities.training.soft_f1_loss(epsilon=1e-07)[source]

Soft F1 Loss for imbalanced binary classification tasks.

Soft F1 Loss provides a differentiable approximation of the F1 score, combining precision and recall into a single metric. By optimizing this loss, models are encouraged to balance false positives and false negatives, which is especially useful when classes are imbalanced.

Parameters:

epsilon (float, optional, default=1e-7) – Small constant added to numerator and denominator to avoid division by zero and stabilize training. Must be > 0.

Returns:

loss – A loss function that takes true labels (y_true) and predicted probabilities (y_pred) and returns 1 - soft_f1, so that minimizing this loss maximizes the soft F1 score.

Return type:

callable

Raises:

ValueError – If epsilon is not strictly positive.

Notes

  • True positives (TP), false positives (FP), and false negatives (FN) are computed in a “soft” (differentiable) manner by summing over probabilities rather than thresholded predictions.

  • Soft F1 = (2·TP + ε) / (2·TP + FP + FN + ε).

  • Loss = 1 − Soft F1, which ranges from 0 (perfect) to 1 (worst).

References

  • Bénédict, G., Koops, V., Odijk D., & de Rijke M. (2021). SigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv 2108.10566.

Explanation of Key Terms

  • True Positives (TP): Sum of predicted probabilities for actual positive examples.

  • False Positives (FP): Sum of predicted probabilities assigned to negative examples.

  • False Negatives (FN): Sum of (1 − predicted probability) for positive examples.

  • ε (epsilon): Stabilizer to prevent division by zero when TP, FP, and FN are all zero.

Examples

`python loss_fn = soft_f1_loss(epsilon=1e-6) y_true = tf.constant([[1, 0, 1]], dtype=tf.float32) y_pred = tf.constant([[0.9, 0.2, 0.7]], dtype=tf.float32) loss_value = loss_fn(y_true, y_pred) print(loss_value.numpy())  # e.g. 0.1… `