Training
- class neer_match_utilities.training.Training(similarity_map, df_left=Empty DataFrame Columns: [] Index: [], df_right=Empty DataFrame Columns: [] Index: [], id_left='id', id_right='id')[source]
A class for managing and evaluating training processes, including reordering matches, evaluating performance metrics, and exporting models.
Inherits:
SuperClass : Base class providing shared attributes and methods.
- evaluate_dataframe(evaluation_test, evaluation_train)[source]
Combines and evaluates test and training performance metrics.
- Parameters:
evaluation_test (dict) – Dictionary containing performance metrics for the test dataset.
evaluation_train (dict) – Dictionary containing performance metrics for the training dataset.
- Returns:
A DataFrame with accuracy, precision, recall, F-score, and a timestamp for both test and training datasets.
- Return type:
pd.DataFrame
- matches_reorder(matches, matches_id_left, matches_id_right)[source]
Reorders a matches DataFrame to include indices from the left and right DataFrames instead of their original IDs.
- Parameters:
matches (pd.DataFrame) – DataFrame containing matching pairs.
matches_id_left (str) – Column name in the matches DataFrame corresponding to the left IDs.
matches_id_right (str) – Column name in the matches DataFrame corresponding to the right IDs.
- Returns:
A DataFrame with columns left and right, representing the indices of matching pairs in the left and right DataFrames.
- Return type:
pd.DataFrame
- performance_statistics_export(model, model_name, target_directory, evaluation_train={}, evaluation_test={})[source]
Exports the trained model, similarity map, and evaluation metrics to the specified directory.
Parameters:
- modelModel object
The trained model to export.
- model_namestr
Name of the model to use as the export directory name.
- target_directoryPath
The target directory where the model will be exported.
- evaluation_traindict, optional
Performance metrics for the training dataset (default is {}).
- evaluation_testdict, optional
Performance metrics for the test dataset (default is {}).
Returns:
: None
Notes:
The method creates a subdirectory named after model_name inside target_directory.
If evaluation_train and evaluation_test are provided, their metrics are saved as a CSV file.
Similarity maps are serialized using dill and saved in the export directory.
- neer_match_utilities.training.alpha_balanced(left, right, matches, mismatch_share=1.0)[source]
Compute α so that α*N_pos = (1-α)*N_neg.
- Parameters:
left (pandas.DataFrame)
right (pandas.DataFrame)
matches (pandas.DataFrame)
- Returns:
α in [0,1] for focal loss (positive-class weight).
- Return type:
float
- neer_match_utilities.training.combined_loss(weight_f1=0.5, epsilon=1e-07, alpha=0.99, gamma=1.5)[source]
Combined loss: weighted sum of Soft F1 loss and Focal Loss for imbalanced binary classification.
This loss blends the advantages of a differentiable F1-based objective (which balances precision and recall) with the sample-focusing property of Focal Loss (which down-weights easy examples). By tuning
weight_f1
, you can interpolate between solely optimizing for F1 score (whenweight_f1 = 1.0
) and solely focusing on hard examples via focal loss (whenweight_f1 = 0.0
).- Parameters:
weight_f1 (float, default=0.5) – Mixing coefficient in
[0, 1]
. -weight_f1 = 1.0
: optimize only Soft F1 loss. -weight_f1 = 0.0
: optimize only Focal Loss. - Intermediate values blend the two objectives proportionally.epsilon (float, default=1e-7) – Small stabilizer for Soft F1 calculation. Must be
> 0
.alpha (float, default=0.25) – Balancing factor for Focal Loss, weighting the positive (minority) class. Must lie in
[0, 1]
.gamma (float, default=2.0) – Focusing parameter for Focal Loss. -
gamma = 0
reduces to weighted BCE. - Largergamma
emphasizes harder (misclassified) examples.
- Returns:
A function
loss(y_true, y_pred)
that computes\[\text{CombinedLoss} = \text{weight\_f1} \cdot \text{SoftF1}(y, \hat{y};\,\varepsilon) + (1 - \text{weight\_f1}) \cdot \text{FocalLoss}(y, \hat{y};\,\alpha, \gamma).\]Minimizing this combined loss encourages both a high F1 score and focus on hard-to-classify samples.
- Return type:
callable
- Raises:
ValueError – If
weight_f1
is not in[0, 1]
, or ifepsilon <= 0
, or ifalpha
is not in[0, 1]
, or ifgamma < 0
.
Notes
Soft F1 loss:
1 - \text{SoftF1}
, where\[\text{SoftF1} = \frac{2\,TP + \varepsilon}{2\,TP + FP + FN + \varepsilon}.\]Here
TP
,FP
, andFN
are soft counts computed from probabilities.Focal Loss down-weights well-classified examples to focus learning on difficult cases.
References
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection. ICCV.
Bénédict, G., Koops, V., Odijk, D., & de Rijke, M. (2021). SigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv:2108.10566.
Examples
import tensorflow as tf loss_fn = combined_loss(weight_f1=0.5, epsilon=1e-6, alpha=0.25, gamma=2.0) y_true = tf.constant([[1, 0, 1]], dtype=tf.float32) y_pred = tf.constant([[0.9, 0.2, 0.7]], dtype=tf.float32) value = loss_fn(y_true, y_pred) print("Combined loss:", float(value.numpy()))
- neer_match_utilities.training.focal_loss(alpha=0.99, gamma=1.5)[source]
Focal Loss function for binary classification tasks.
Focal Loss is designed to address class imbalance by assigning higher weights to the minority class and focusing the model’s learning on hard-to-classify examples. It reduces the loss contribution from well-classified examples, making it particularly effective for imbalanced datasets.
- Parameters:
alpha (float, optional, default=0.75) –
Weighting factor for the positive class (minority class).
Must be in the range [0, 1].
A higher value increases the loss contribution from the positive class (underrepresented class) relative to the negative class (overrepresented class).
gamma (float, optional, default=2.0) –
Focusing parameter that reduces the loss contribution from easy examples.
gamma = 0
: No focusing, equivalent to Weighted Binary Cross-Entropy Loss (if alpha is set to 0.5).gamma > 0
: Focuses more on hard-to-classify examples.Larger values emphasize harder examples more strongly.
- Returns:
loss – A loss function that computes the focal loss given the true labels (y_true) and predicted probabilities (y_pred).
- Return type:
callable
- Raises:
ValueError – If alpha is not in the range [0, 1].
Notes
The positive class (minority or underrepresented class) is weighted by alpha.
The negative class (majority or overrepresented class) is automatically weighted by
1 - alpha
.Ensure alpha is set appropriately to reflect the level of imbalance in the dataset.
References
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal Loss for Dense Object Detection. In ICCV.
Explanation of Key Terms
Positive Class (Underrepresented):
Refers to the class with fewer examples in the dataset.
Typically weighted by alpha, which should be greater than 0.5 in highly imbalanced datasets.
Negative Class (Overrepresented):
Refers to the class with more examples in the dataset.
Its weight is automatically
1 - alpha
.
- neer_match_utilities.training.soft_f1_loss(epsilon=1e-07)[source]
Soft F1 Loss for imbalanced binary classification tasks.
Soft F1 Loss provides a differentiable approximation of the F1 score, combining precision and recall into a single metric. By optimizing this loss, models are encouraged to balance false positives and false negatives, which is especially useful when classes are imbalanced.
- Parameters:
epsilon (float, optional, default=1e-7) – Small constant added to numerator and denominator to avoid division by zero and stabilize training. Must be > 0.
- Returns:
loss – A loss function that takes true labels (y_true) and predicted probabilities (y_pred) and returns 1 - soft_f1, so that minimizing this loss maximizes the soft F1 score.
- Return type:
callable
- Raises:
ValueError – If epsilon is not strictly positive.
Notes
True positives (TP), false positives (FP), and false negatives (FN) are computed in a “soft” (differentiable) manner by summing over probabilities rather than thresholded predictions.
Soft F1 = (2·TP + ε) / (2·TP + FP + FN + ε).
Loss = 1 − Soft F1, which ranges from 0 (perfect) to 1 (worst).
References
Bénédict, G., Koops, V., Odijk D., & de Rijke M. (2021). SigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. arXiv 2108.10566.
Explanation of Key Terms
True Positives (TP): Sum of predicted probabilities for actual positive examples.
False Positives (FP): Sum of predicted probabilities assigned to negative examples.
False Negatives (FN): Sum of (1 − predicted probability) for positive examples.
ε (epsilon): Stabilizer to prevent division by zero when TP, FP, and FN are all zero.
Examples
`python loss_fn = soft_f1_loss(epsilon=1e-6) y_true = tf.constant([[1, 0, 1]], dtype=tf.float32) y_pred = tf.constant([[0.9, 0.2, 0.7]], dtype=tf.float32) loss_value = loss_fn(y_true, y_pred) print(loss_value.numpy()) # e.g. 0.1… `