Saving a Model
Prepare the Data
This tutorial illustrates how to correctly store a neer-match
model.
Before getting to the part that illustrates the saving, we have to
repeat the steps detailed in Data Preparation for Training:
import random
import pandas as pd
from neer_match.similarity_map import SimilarityMap
from neer_match_utilities.panel import SetupData
from neer_match_utilities.prepare import Prepare
from neer_match_utilities.training import Training
from neer_match_utilities.split import split_test_train
from neer_match_utilities.custom_similarities import CustomSimilarities
random.seed(42)
# Load files
matches = pd.read_csv('matches.csv')
left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')
# Preparation
left, right, matches = SetupData(matches=matches).data_preparation_cs(
df_left=left,
df_right=right,
unique_id='company_id'
)
# Load custom similarities dedicated to missing values from the utilities package
CustomSimilarities()
# Define similarity_map
similarity_map = {
"company_name" : [
"levenshtein",
"jaro_winkler",
"prefix",
"postfix",
"token_sort_ratio",
"token_set_ratio",
"partial_token_set_ratio",
"partial_token_sort_ratio",
],
"city" : [
"levenshtein",
"jaro_winkler",
"notmissing",
],
"industry" : [
"levenshtein",
"jaro_winkler",
"notmissing",
],
"purpose" : [
"levenshtein",
"jaro_winkler",
"token_sort_ratio",
"token_set_ratio",
"partial_token_set_ratio",
"partial_token_sort_ratio",
"notmissing",
],
"bs_text" : [
"levenshtein",
"jaro_winkler",
"token_sort_ratio",
"token_set_ratio",
"partial_token_set_ratio",
"partial_token_sort_ratio",
"notmissing",
],
"found_year" : [
"discrete",
"notzero"
],
"found_date_modified" : [
"discrete",
"notmissing",
]
}
smap = SimilarityMap(similarity_map)
prepare = Prepare(
similarity_map=similarity_map,
df_left=left,
df_right=right,
id_left='company_id',
id_right='company_id'
)
left, right = prepare.format(
fill_numeric_na=False,
to_numeric=['found_year'],
fill_string_na=True,
capitalize=True
)
# Re-Structuring
training = Training(
similarity_map=similarity_map,
df_left=left,
df_right=right,
id_left='company_id',
id_right='company_id'
)
matches = training.matches_reorder(
matches,
matches_id_left='left',
matches_id_right='right'
)
# Splitting data
left_train, right_train, matches_train, left_validation, right_validation, matches_validation, left_test, right_test, matches_test = split_test_train(
left = left,
right = right,
matches = matches,
test_ratio = .7,
validation_ratio = .0
)
Train the Model
Once the data is in the correct structure, training the model follows.
Note that this example utilizes the custom combined_loss
function.
This loss function is designed to address class imbalance by assigning
higher weights to the minority class and focusing the model’s learning
on hard-to-classify examples, while simultaneously aligning optimization
with the F1 score to balance precision and recall. By reducing the loss
contribution from well-classified examples, it is particularly effective
for imbalanced datasets. Its design is based on ideas presented in Lin
et al. (2017) as well as
Bénédict et al. (2021).
In detail, the combined_loss
is defined as
where (y {0,1}) are true labels and (y ) are predicted probabilities.
The soft F1 is
with (TP, FP, FN) computed as soft counts (sums over probabilities) and (> 0) for numerical stability.
The binary focal loss is
To balance the total weight of positives and negatives in the loss, pick \(\alpha\) such that the summed positive weight equals the summed negative weight:
with
from neer_match.matching_model import DLMatchingModel, NSMatchingModel
from neer_match_utilities.training import combined_loss, alpha_balanced
import tensorflow as tf
import os
# Initialize the model
model = DLMatchingModel(
similarity_map=smap,
initial_feature_width_scales = 10,
feature_depths = 4,
initial_record_width_scale=10,
record_depth = 8,
)
# Compile the model
## Calculate the alpha parameter given the dimensions of the training data
mismatch_share = 1.0
alpha = alpha_balanced(left_train, right_train, matches_train, mismatch_share) # ≈ N_neg / (N_pos + N_neg)
## Use expontantial decay for the learning rate
initial_lr = 1e-4
lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
initial_learning_rate=initial_lr,
decay_steps=5000,
decay_rate=0.96,
staircase=True
)
model.compile(
loss = combined_loss(
weight_f1=0.5,
epsilon=1e-07,
alpha=alpha,
gamma=0.5
),
optimizer = tf.keras.optimizers.Adam(
learning_rate=lr_schedule,
clipnorm=.5
)
)
# Fit the model
model.fit(
left_train, right_train, matches_train,
epochs=20,
batch_size=64,
mismatch_share=mismatch_share,
shuffle=True,
)
Export the Model
Once the model is trained, it can be exported via the Model
class. The
name
parameter specifies the label of the folder in which the model
will be stored.
from neer_match_utilities.model import Model
from pathlib import Path
Model.save(
model = model,
target_directory = Path(__file__).resolve().parent,
name = 'example_model'
)
Export Performance Statistics
In addition to this, we can evaluate the model on both the training and test datasets, and export the corresponding performance statistics into the folder created when exporting the model.
# Evaluate the model
performance_train= model.evaluate(
left_train,
right_train,
matches_train,
mismatch_share=1.0
)
performance_test = model.evaluate(
left_test,
right_test,
matches_test,
mismatch_share=1.0
)
# Export performance statistics
training.performance_statistics_export(
model = model,
model_name = 'example_model',
target_directory = Path(__file__).resolve().parent,
evaluation_train = performance_train,
evaluation_test = performance_test,
)