Training
- class neer_match_utilities.training.Training(similarity_map, df_left=Empty DataFrame Columns: [] Index: [], df_right=Empty DataFrame Columns: [] Index: [], id_left='id', id_right='id')[source]
A class for managing and evaluating training processes, including reordering matches, evaluating performance metrics, and exporting models.
Inherits:
SuperClass : Base class providing shared attributes and methods.
- evaluate_dataframe(evaluation_test, evaluation_train)[source]
Combines and evaluates test and training performance metrics.
- Parameters:
evaluation_test (dict) – Dictionary containing performance metrics for the test dataset.
evaluation_train (dict) – Dictionary containing performance metrics for the training dataset.
- Returns:
A DataFrame with accuracy, precision, recall, F-score, and a timestamp for both test and training datasets.
- Return type:
pd.DataFrame
- matches_reorder(matches, matches_id_left, matches_id_right)[source]
Reorders a matches DataFrame to include indices from the left and right DataFrames instead of their original IDs.
- Parameters:
matches (pd.DataFrame) – DataFrame containing matching pairs.
matches_id_left (str) – Column name in the matches DataFrame corresponding to the left IDs.
matches_id_right (str) – Column name in the matches DataFrame corresponding to the right IDs.
- Returns:
A DataFrame with columns left and right, representing the indices of matching pairs in the left and right DataFrames.
- Return type:
pd.DataFrame
- performance_statistics_export(model, model_name, target_directory, evaluation_train=None, evaluation_test=None, export_model=False)[source]
Exports performance metrics + similarity map (as before). Optionally also saves the model:
DL/NS via Model.save
baseline via ModelBaseline.save
Default behavior remains unchanged because export_model defaults to False.
- class neer_match_utilities.training.TrainingPipe(model_name, training_data, testing_data, similarity_map, *, initial_feature_width_scales=10, feature_depths=2, initial_record_width_scale=10, record_depth=4, id_left_col='id_unique', id_right_col='id_unique', no_tm_pbatch=None, save_architecture=False, stage_1=True, stage_2=True, epochs_1=None, mismatch_share_1=None, stage1_loss=<function soft_f1_loss.<locals>.loss>, epochs_2=None, mismatch_share_2=None, gamma=None, max_alpha=None)[source]
Orchestrates the full training and evaluation process of a deep learning record-linkage model using a user-supplied similarity map and preprocessed data.
The class handles both training phases (soft-F1 pretraining and focal-loss fine-tuning), dynamic learning-rate scheduling, and automatic weight-decay adaptation. It also exports checkpoints, final models, and evaluation statistics for reproducibility.
Parameters:
- model_namestr
Name assigned to the trained model. A corresponding subdirectory is created under the project directory to store checkpoints and exports.
- training_datatuple or dict
Preprocessed training data in one of the following formats: - Tuple: (left_train, right_train, matches_train) - Dict: {“left”: left_train, “right”: right_train, “matches”: matches_train}
- testing_datatuple or dict
Preprocessed testing data in one of the following formats: - Tuple: (left_test, right_test, matches_test) - Dict: {“left”: left_test, “right”: right_test, “matches”: matches_test}
- similarity_mapdict
User-defined similarity configuration mapping variable names to similarity measures. Must follow the format accepted by SimilarityMap.
- id_left_colstr, optional
Name of the unique identifier column in the left DataFrames (left_train and left_test). The ID column is used internally to index entities and to align training labels. Defaults to “id_unique”.
- id_right_colstr, optional
Name of the unique identifier column in the right DataFrames (right_train and right_test). The ID column is used internally to index entities and to align training labels. Defaults to “id_unique”.
- no_tm_pbatchint
Target number of positive (matching) pairs per batch. Used to adapt batch size dynamically via the required_batch_size heuristic.
- save_architecture: bool, optional
Whether to save the model architecture in an image alongside weights when exporting. Requires binaries of of graphviz to be installed. Otherwise, the code breaks. Defaults to False.
- stage_1bool, optional
Whether to run the first training stage (soft-F1 pretraining). If False, the arguments epochs_1 and mismatch_share_1 are not required and will be ignored.
- stage_2bool, optional
Whether to run the second training stage (focal-loss fine-tuning). If False, the arguments epochs_2, mismatch_share_2, no_tm_pbatch, gamma, and max_alpha are not required and will be ignored.
- epochs_1int, optional
Number of training epochs during the first phase (soft-F1 pretraining). Required only when stage_1=True.
- mismatch_share_1float, optional
Fraction of all possible negative (non-matching) pairs used during Round 1. Required only when stage_1=True.
- stage1_lossstr or callable, optional
Loss function used during Stage 1 (pretraining). By default, this is
soft_f1_loss(), reproducing the original NeerMatch behavior.The argument accepts either:
- A string specifying a built-in or predefined loss:
"soft_f1"— use the standard soft-F1 loss (default)"binary_crossentropy"— use wrapped binary crossentropy
(internally adapted for NeerMatch’s evaluation loop)
- A callable loss function, allowing full customization:
soft_f1_loss()— explicit soft-F1 lossfocal_loss(alpha=0.25, gamma=2.0)— focal loss with parametersAny user-defined loss function of signature
loss(y_true, y_pred)
- epochs_2int, optional
Number of training epochs during the second phase (focal-loss fine-tuning). Required only when stage_2=True.
- mismatch_share_2float, optional
Fraction of sampled negative pairs used during Round 2. Required only when stage_2=True.
- gammafloat, optional
Focusing parameter of the focal loss (Round 2). Required only when stage_2=True.
- max_alphafloat, optional
Maximum weighting factor of the positive class for focal loss (Round 2). Required only when stage_2=True.
Returns:
None
Notes:
The pipeline assumes that the data have already been preprocessed, formatted, and tokenized.
Round 1 (soft-F1 phase) initializes the model and emphasizes balanced learning across classes.
Round 2 (focal-loss phase) refines the model to focus on hard-to-classify examples.
- Dynamic heuristics are used to automatically infer:
Batch size (via expected positive density)
Peak learning rate (scaled with batch size, positives per batch, and parameter count)
Weight decay (adjusted based on model size and learning rate)
Model checkpoints, histories, and evaluation reports are stored in subdirectories named after the provided model_name.
The final model, similarity map, and performance metrics are exported to disk using the Training.performance_statistics_export method for reproducibility.
Each training stage can be enabled or disabled independently through
the stage_1 and stage_2 flags. - If a stage is disabled, its hyperparameters are not required and will be ignored. - When only one stage is active, the warm-up pass automatically adapts to the active stage’s mismatch sampling configuration.
- __init__(model_name, training_data, testing_data, similarity_map, *, initial_feature_width_scales=10, feature_depths=2, initial_record_width_scale=10, record_depth=4, id_left_col='id_unique', id_right_col='id_unique', no_tm_pbatch=None, save_architecture=False, stage_1=True, stage_2=True, epochs_1=None, mismatch_share_1=None, stage1_loss=<function soft_f1_loss.<locals>.loss>, epochs_2=None, mismatch_share_2=None, gamma=None, max_alpha=None)[source]
- neer_match_utilities.training.alpha_balanced(left, right, matches, mismatch_share=1.0, max_alpha=0.95)[source]
Compute α so that α*N_pos = (1-α)*N_neg.
- Parameters:
left (pandas.DataFrame)
right (pandas.DataFrame)
matches (pandas.DataFrame)
- Returns:
α in [0,1] for focal loss (positive-class weight).
- Return type:
float
- neer_match_utilities.training.combined_loss(weight_f1=0.5, epsilon=1e-07, alpha=0.99, gamma=1.5)[source]
Combined loss: weighted sum of Soft F1 loss and Focal Loss for imbalanced binary classification.
This loss blends the advantages of a differentiable F1-based objective (which balances precision and recall) with the sample-focusing property of Focal Loss (which down-weights easy examples). By tuning
weight_f1, you can interpolate between solely optimizing for F1 score (whenweight_f1 = 1.0) and solely focusing on hard examples via focal loss (whenweight_f1 = 0.0).- Parameters:
weight_f1 (float, default=0.5) – Mixing coefficient in
[0, 1]. -weight_f1 = 1.0: optimize only Soft F1 loss. -weight_f1 = 0.0: optimize only Focal Loss. - Intermediate values blend the two objectives proportionally.epsilon (float, default=1e-7) – Small stabilizer for Soft F1 calculation. Must be
> 0.alpha (float, default=0.25) – Balancing factor for Focal Loss, weighting the positive (minority) class. Must lie in
[0, 1].gamma (float, default=2.0) – Focusing parameter for Focal Loss. -
gamma = 0reduces to weighted BCE. - Largergammaemphasizes harder (misclassified) examples.
- Returns:
A function
loss(y_true, y_pred)that computes\[\text{CombinedLoss} = \text{weight\_f1} \cdot \text{SoftF1}(y, \hat{y};\,\varepsilon) + (1 - \text{weight\_f1}) \cdot \text{FocalLoss}(y, \hat{y};\,\alpha, \gamma).\]Minimizing this combined loss encourages both a high F1 score and focus on hard-to-classify samples.
- Return type:
callable
- Raises:
ValueError – If
weight_f1is not in[0, 1], or ifepsilon <= 0, or ifalphais not in[0, 1], or ifgamma < 0.
Notes
Soft F1 loss:
1 - \text{SoftF1}, where\[\text{SoftF1} = \frac{2\,TP + \varepsilon}{2\,TP + FP + FN + \varepsilon}.\]Here
TP,FP, andFNare soft counts computed from probabilities.Focal Loss down-weights well-classified examples to focus learning on difficult cases.
References
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection. ICCV.
Bénédict, G., Koops, V., Odijk, D., & de Rijke, M. (2021). SigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv:2108.10566.
Examples
import tensorflow as tf loss_fn = combined_loss(weight_f1=0.5, epsilon=1e-6, alpha=0.25, gamma=2.0) y_true = tf.constant([[1, 0, 1]], dtype=tf.float32) y_pred = tf.constant([[0.9, 0.2, 0.7]], dtype=tf.float32) value = loss_fn(y_true, y_pred) print("Combined loss:", float(value.numpy()))
- neer_match_utilities.training.focal_loss(alpha=0.99, gamma=1.5)[source]
Focal Loss function for binary classification tasks.
Focal Loss is designed to address class imbalance by assigning higher weights to the minority class and focusing the model’s learning on hard-to-classify examples. It reduces the loss contribution from well-classified examples, making it particularly effective for imbalanced datasets.
- Parameters:
alpha (float, optional, default=0.75) –
Weighting factor for the positive class (minority class).
Must be in the range [0, 1].
A higher value increases the loss contribution from the positive class (underrepresented class) relative to the negative class (overrepresented class).
gamma (float, optional, default=2.0) –
Focusing parameter that reduces the loss contribution from easy examples.
gamma = 0: No focusing, equivalent to Weighted Binary Cross-Entropy Loss (if alpha is set to 0.5).gamma > 0: Focuses more on hard-to-classify examples.Larger values emphasize harder examples more strongly.
- Returns:
loss – A loss function that computes the focal loss given the true labels (y_true) and predicted probabilities (y_pred).
- Return type:
callable
- Raises:
ValueError – If alpha is not in the range [0, 1].
Notes
The positive class (minority or underrepresented class) is weighted by alpha.
The negative class (majority or overrepresented class) is automatically weighted by
1 - alpha.Ensure alpha is set appropriately to reflect the level of imbalance in the dataset.
References
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection. In ICCV.
Explanation of Key Terms
Positive Class (Underrepresented):
Refers to the class with fewer examples in the dataset.
Typically weighted by alpha, which should be greater than 0.5 in highly imbalanced datasets.
Negative Class (Overrepresented):
Refers to the class with more examples in the dataset.
Its weight is automatically
1 - alpha.
- neer_match_utilities.training.soft_f1_loss(epsilon=1e-07)[source]
Soft F1 Loss for imbalanced binary classification tasks.
Soft F1 Loss provides a differentiable approximation of the F1 score, combining precision and recall into a single metric. By optimizing this loss, models are encouraged to balance false positives and false negatives, which is especially useful when classes are imbalanced.
- Parameters:
epsilon (float, optional, default=1e-7) – Small constant added to numerator and denominator to avoid division by zero and stabilize training. Must be > 0.
- Returns:
loss – A loss function that takes true labels (y_true) and predicted probabilities (y_pred) and returns 1 - soft_f1, so that minimizing this loss maximizes the soft F1 score.
- Return type:
callable
- Raises:
ValueError – If epsilon is not strictly positive.
Notes
True positives (TP), false positives (FP), and false negatives (FN) are computed in a “soft” (differentiable) manner by summing over probabilities rather than thresholded predictions.
Soft F1 = (2·TP + ε) / (2·TP + FP + FN + ε).
Loss = 1 − Soft F1, which ranges from 0 (perfect) to 1 (worst).
References
Bénédict, G., Koops, V., Odijk D., & de Rijke M. (2021). SigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv 2108.10566.
Explanation of Key Terms
True Positives (TP): Sum of predicted probabilities for actual positive examples.
False Positives (FP): Sum of predicted probabilities assigned to negative examples.
False Negatives (FN): Sum of (1 − predicted probability) for positive examples.
ε (epsilon): Stabilizer to prevent division by zero when TP, FP, and FN are all zero.
Examples
`python loss_fn = soft_f1_loss(epsilon=1e-6) y_true = tf.constant([[1, 0, 1]], dtype=tf.float32) y_pred = tf.constant([[0.9, 0.2, 0.7]], dtype=tf.float32) loss_value = loss_fn(y_true, y_pred) print(loss_value.numpy()) # e.g. 0.1… `