Training
- class neer_match_utilities.training.Training(similarity_map, df_left=Empty DataFrame Columns: [] Index: [], df_right=Empty DataFrame Columns: [] Index: [], id_left='id', id_right='id')[source]
A class for managing and evaluating training processes, including reordering matches, evaluating performance metrics, and exporting models.
Inherits:
SuperClass : Base class providing shared attributes and methods.
- evaluate_dataframe(evaluation_test, evaluation_train)[source]
Combines and evaluates test and training performance metrics.
- Parameters:
evaluation_test (dict) – Dictionary containing performance metrics for the test dataset.
evaluation_train (dict) – Dictionary containing performance metrics for the training dataset.
- Returns:
A DataFrame with accuracy, precision, recall, F-score, and a timestamp for both test and training datasets.
- Return type:
pd.DataFrame
- matches_reorder(matches, matches_id_left, matches_id_right)[source]
Reorders a matches DataFrame to include indices from the left and right DataFrames instead of their original IDs.
- Parameters:
matches (pd.DataFrame) – DataFrame containing matching pairs.
matches_id_left (str) – Column name in the matches DataFrame corresponding to the left IDs.
matches_id_right (str) – Column name in the matches DataFrame corresponding to the right IDs.
- Returns:
A DataFrame with columns left and right, representing the indices of matching pairs in the left and right DataFrames.
- Return type:
pd.DataFrame
- performance_statistics_export(model, model_name, target_directory, evaluation_train={}, evaluation_test={})[source]
Exports the trained model, similarity map, and evaluation metrics to the specified directory.
Parameters:
- modelModel object
The trained model to export.
- model_namestr
Name of the model to use as the export directory name.
- target_directoryPath
The target directory where the model will be exported.
- evaluation_traindict, optional
Performance metrics for the training dataset (default is {}).
- evaluation_testdict, optional
Performance metrics for the test dataset (default is {}).
Returns:
: None
Notes:
The method creates a subdirectory named after model_name inside target_directory.
If evaluation_train and evaluation_test are provided, their metrics are saved as a CSV file.
Similarity maps are serialized using dill and saved in the export directory.
- class neer_match_utilities.training.TrainingPipe(model_name, epochs_1, mismatch_share_1, epochs_2, mismatch_share_2, no_tm_pbatch, gamma, max_alpha, training_data, testing_data, similarity_map, initial_feature_width_scales=10, feature_depths=2, initial_record_width_scale=10, record_depth=4)[source]
Orchestrates the full training and evaluation process of a deep learning record-linkage model using a user-supplied similarity map and preprocessed data.
The class handles both training phases (soft-F1 pretraining and focal-loss fine-tuning), dynamic learning-rate scheduling, and automatic weight-decay adaptation. It also exports checkpoints, final models, and evaluation statistics for reproducibility.
Parameters:
- model_namestr
Name assigned to the trained model. A corresponding subdirectory is created under the project directory to store checkpoints and exports.
- test_ratiofloat
Proportion of data reserved for testing, used for logging and performance tracking.
- epochs_1int
Number of training epochs during the first phase (soft-F1 pretraining).
- mismatch_share_1float
Fraction of all possible negative (non-matching) pairs used during Round 1 training.
- epochs_2int
Number of training epochs during the second phase (focal-loss fine-tuning).
- mismatch_share_2float
Fraction of all possible negative pairs used during Round 2 training.
- no_tm_pbatchint
Target number of positive (matching) pairs per batch. Used to adapt batch size dynamically via the required_batch_size heuristic.
- gammafloat
Focusing parameter of the focal loss (applies in Round 2). Larger values emphasize hard-to-classify examples.
- max_alphafloat
Maximum weighting factor for the positive class in the focal loss. Prevents instability for extremely imbalanced datasets.
- training_datatuple or dict
Preprocessed training data in one of the following formats: - Tuple: (left_train, right_train, matches_train) - Dict: {“left”: left_train, “right”: right_train, “matches”: matches_train}
Each element must be a pandas.DataFrame containing an id_unique column.
- testing_datatuple or dict
Preprocessed testing data in one of the following formats: - Tuple: (left_test, right_test, matches_test) - Dict: {“left”: left_test, “right”: right_test, “matches”: matches_test}
Each element must be a pandas.DataFrame containing an id_unique column.
- similarity_mapdict
User-defined similarity configuration mapping variable names to similarity measures. Must follow the format accepted by SimilarityMap.
Returns:
None
Notes:
The pipeline assumes that the data have already been preprocessed, formatted, and tokenized.
Round 1 (soft-F1 phase) initializes the model and emphasizes balanced learning across classes.
Round 2 (focal-loss phase) refines the model to focus on hard-to-classify examples.
- Dynamic heuristics are used to automatically infer:
Batch size (via expected positive density)
Peak learning rate (scaled with batch size, positives per batch, and parameter count)
Weight decay (adjusted based on model size and learning rate)
Model checkpoints, histories, and evaluation reports are stored in subdirectories named after the provided model_name.
The final model, similarity map, and performance metrics are exported to disk using the Training.performance_statistics_export method for reproducibility.
- neer_match_utilities.training.alpha_balanced(left, right, matches, mismatch_share=1.0, max_alpha=0.95)[source]
Compute α so that α*N_pos = (1-α)*N_neg.
- Parameters:
left (pandas.DataFrame)
right (pandas.DataFrame)
matches (pandas.DataFrame)
- Returns:
α in [0,1] for focal loss (positive-class weight).
- Return type:
float
- neer_match_utilities.training.combined_loss(weight_f1=0.5, epsilon=1e-07, alpha=0.99, gamma=1.5)[source]
Combined loss: weighted sum of Soft F1 loss and Focal Loss for imbalanced binary classification.
This loss blends the advantages of a differentiable F1-based objective (which balances precision and recall) with the sample-focusing property of Focal Loss (which down-weights easy examples). By tuning
weight_f1, you can interpolate between solely optimizing for F1 score (whenweight_f1 = 1.0) and solely focusing on hard examples via focal loss (whenweight_f1 = 0.0).- Parameters:
weight_f1 (float, default=0.5) – Mixing coefficient in
[0, 1]. -weight_f1 = 1.0: optimize only Soft F1 loss. -weight_f1 = 0.0: optimize only Focal Loss. - Intermediate values blend the two objectives proportionally.epsilon (float, default=1e-7) – Small stabilizer for Soft F1 calculation. Must be
> 0.alpha (float, default=0.25) – Balancing factor for Focal Loss, weighting the positive (minority) class. Must lie in
[0, 1].gamma (float, default=2.0) – Focusing parameter for Focal Loss. -
gamma = 0reduces to weighted BCE. - Largergammaemphasizes harder (misclassified) examples.
- Returns:
A function
loss(y_true, y_pred)that computes\[\text{CombinedLoss} = \text{weight\_f1} \cdot \text{SoftF1}(y, \hat{y};\,\varepsilon) + (1 - \text{weight\_f1}) \cdot \text{FocalLoss}(y, \hat{y};\,\alpha, \gamma).\]Minimizing this combined loss encourages both a high F1 score and focus on hard-to-classify samples.
- Return type:
callable
- Raises:
ValueError – If
weight_f1is not in[0, 1], or ifepsilon <= 0, or ifalphais not in[0, 1], or ifgamma < 0.
Notes
Soft F1 loss:
1 - \text{SoftF1}, where\[\text{SoftF1} = \frac{2\,TP + \varepsilon}{2\,TP + FP + FN + \varepsilon}.\]Here
TP,FP, andFNare soft counts computed from probabilities.Focal Loss down-weights well-classified examples to focus learning on difficult cases.
References
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection. ICCV.
Bénédict, G., Koops, V., Odijk, D., & de Rijke, M. (2021). SigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv:2108.10566.
Examples
import tensorflow as tf loss_fn = combined_loss(weight_f1=0.5, epsilon=1e-6, alpha=0.25, gamma=2.0) y_true = tf.constant([[1, 0, 1]], dtype=tf.float32) y_pred = tf.constant([[0.9, 0.2, 0.7]], dtype=tf.float32) value = loss_fn(y_true, y_pred) print("Combined loss:", float(value.numpy()))
- neer_match_utilities.training.focal_loss(alpha=0.99, gamma=1.5)[source]
Focal Loss function for binary classification tasks.
Focal Loss is designed to address class imbalance by assigning higher weights to the minority class and focusing the model’s learning on hard-to-classify examples. It reduces the loss contribution from well-classified examples, making it particularly effective for imbalanced datasets.
- Parameters:
alpha (float, optional, default=0.75) –
Weighting factor for the positive class (minority class).
Must be in the range [0, 1].
A higher value increases the loss contribution from the positive class (underrepresented class) relative to the negative class (overrepresented class).
gamma (float, optional, default=2.0) –
Focusing parameter that reduces the loss contribution from easy examples.
gamma = 0: No focusing, equivalent to Weighted Binary Cross-Entropy Loss (if alpha is set to 0.5).gamma > 0: Focuses more on hard-to-classify examples.Larger values emphasize harder examples more strongly.
- Returns:
loss – A loss function that computes the focal loss given the true labels (y_true) and predicted probabilities (y_pred).
- Return type:
callable
- Raises:
ValueError – If alpha is not in the range [0, 1].
Notes
The positive class (minority or underrepresented class) is weighted by alpha.
The negative class (majority or overrepresented class) is automatically weighted by
1 - alpha.Ensure alpha is set appropriately to reflect the level of imbalance in the dataset.
References
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection. In ICCV.
Explanation of Key Terms
Positive Class (Underrepresented):
Refers to the class with fewer examples in the dataset.
Typically weighted by alpha, which should be greater than 0.5 in highly imbalanced datasets.
Negative Class (Overrepresented):
Refers to the class with more examples in the dataset.
Its weight is automatically
1 - alpha.
- neer_match_utilities.training.soft_f1_loss(epsilon=1e-07)[source]
Soft F1 Loss for imbalanced binary classification tasks.
Soft F1 Loss provides a differentiable approximation of the F1 score, combining precision and recall into a single metric. By optimizing this loss, models are encouraged to balance false positives and false negatives, which is especially useful when classes are imbalanced.
- Parameters:
epsilon (float, optional, default=1e-7) – Small constant added to numerator and denominator to avoid division by zero and stabilize training. Must be > 0.
- Returns:
loss – A loss function that takes true labels (y_true) and predicted probabilities (y_pred) and returns 1 - soft_f1, so that minimizing this loss maximizes the soft F1 score.
- Return type:
callable
- Raises:
ValueError – If epsilon is not strictly positive.
Notes
True positives (TP), false positives (FP), and false negatives (FN) are computed in a “soft” (differentiable) manner by summing over probabilities rather than thresholded predictions.
Soft F1 = (2·TP + ε) / (2·TP + FP + FN + ε).
Loss = 1 − Soft F1, which ranges from 0 (perfect) to 1 (worst).
References
Bénédict, G., Koops, V., Odijk D., & de Rijke M. (2021). SigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv 2108.10566.
Explanation of Key Terms
True Positives (TP): Sum of predicted probabilities for actual positive examples.
False Positives (FP): Sum of predicted probabilities assigned to negative examples.
False Negatives (FN): Sum of (1 − predicted probability) for positive examples.
ε (epsilon): Stabilizer to prevent division by zero when TP, FP, and FN are all zero.
Examples
`python loss_fn = soft_f1_loss(epsilon=1e-6) y_true = tf.constant([[1, 0, 1]], dtype=tf.float32) y_pred = tf.constant([[0.9, 0.2, 0.7]], dtype=tf.float32) loss_value = loss_fn(y_true, y_pred) print(loss_value.numpy()) # e.g. 0.1… `