A Minimal Training Pipeline

This document explains the data preparation process for training our matching model. The example data comes from a research project that digitized historic records of German joint-stock companies (Gram et al. 2022). The data contains inconsistencies in spelling, primarily due to variations in abbreviation conventions and OCR errors, across most variables. These challenges make it a compelling real-world use case for entity matching.

The data consists of three files:

left.csv
right.csv
matches.csv

Loading the Data

Training the pipelines requires three datasets:

left (observations from one source or period)
right (observations from another source or period)
matches (a dataframe where each row contains the unique IDs of matching entities from left and right)

import random
import pandas as pd

matches = pd.read_csv('matches.csv')
left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')

Preview of the matches data:

matches.head()

	company_id_left	company_id_right
0	1e87fc75b4	0008e07878
1	810c9c3435	8bf51ba8a0
2	571dfb67e2	90b6db7ed3
3	d67d97da08	b0c68f1152
4	22ac99ae20	e9823a3073

Preview of the left dataset:

left.head()

	company_id	oai_identifier	company_name	company_info_1	company_info_2	pdf_page_num	found_year	found_date_modified	register_year	register_date_modified	…	effect_year	item_rank	purpose	city	bs_text	sboard_text	proc_text	capital_text	volume	industry
0	1e87fc75b4	1006345701_18970010	Glückauf-, Actien-Gesellschaft für Braunkohlen…	NaN	NaN	627	1871.0	1871-08-03	NaN	NaN	…	NaN	1.0	Abbau von Braunkohlenlagern u. Brikettfabrikat…	Lichtenau	Grundst cke M Grubenwert M Schachtanlagen M Ge…	sichtsrat Vors Buchh ndler Abel Dietzel Gumper…	NaN	M 660 000 in 386 Priorit tsaktien M 1 500	1	Bergwerke, Hütten- und Salinenwesen.
1	810c9c3435	1006345701_189900031	Deutsch-Oesterreichische Mannesmannröhren-Werke	in Berlin W. u. Düsseldorf mit Zweigniederlass…	NaN	501	1890.0	1890-07-16	NaN	NaN	…	NaN	1.0	Betrieb der Mannesmannröhren-Walzwerke in Rems…	Berlin	Generaldirektion D sseldorf Mobiliar u Utensil…	Vors Direktor Max Steinthal Stellv Karl v d He…	Dr M Fuchs A Krusche Berlin G Hethey N Eich	M 25 900 000 in 23 875 Inhaber Aktien Lit	3	Bergwerke, Hütten- und Salinenwesen.
2	571dfb67e2	1006345701_191900231	Handwerkerbank Spaichingen, Akt.-Ges. in Spaic…	NaN	NaN	345	1889.0	1889-11-24	NaN	NaN	…	NaN	1.0	Betrieb von Bank- und Kommissionsgeschäften in…	Spaichingen	Forderung an Aktion re Immobil Gerichtskosten …	Vors Wilh Lobmiller Stellv Franz Xav Schmid Sa…	NaN	M 600 000 in 600 Aktien M 1000 Urspr M	23	Kredit-Banken und andere Geld-Institute.
3	d67d97da08	1006345701_191300172	Vorschuss-Anstalt für Malchin A.-G.	NaN	Letzte Statutänd. 10./7. 1900. Kapital: M. 900…	165	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	A	Forder Effekten u Hypoth Debit Bankguth Kassa …	W Deutler E Buhr W Fehlow	NaN	NaN	17	Geld-Institute etc.
4	22ac99ae20	1006345701_191200161	Kaisersteinbruch-Actiengesellschaft in Liqu. i…	NaN	NaN	1443	1900.0	1900-03-17	1900.0	1900-04-11	…	1900.0	1.0	Betrieb von Steinhauereien u. aller mit dem Ba…	Köln	Steinbr che Steinhauerei Immobil Mannheim Mobi…	Vors Dr jur P Stephan Rheinbreitbach b Unkel S…	NaN	M 450 000 in 150abgest Vorz Aktien u 300 doppelt	16	Industrie der Steine und Erden.

5 rows × 21 columns

Preview of the right dataset:

right.head()

	company_id	oai_identifier	company_name	company_info_1	company_info_2	pdf_page_num	found_year	found_date_modified	register_year	register_date_modified	…	effect_year	item_rank	purpose	city	bs_text	sboard_text	proc_text	capital_text	volume	industry
0	0008e07878	1006345701_189800021	„Glückauf-’, Act.-Ges. für Braunkohlen-Verwert…	NaN	NaN	1038	1871.0	1871-08-03	NaN	NaN	…	NaN	1.0	Abbau von Braunkohlenlagern u. Brikettfabrikat…	NaN	Grundst cke M Grubenwert M Schachtanlagen M Ge…	Vors Buchh ndler Abel Dietzel Gumpert Lehmann …	NaN	M 660 000 in 386 Vorzugsaktien M 1500 14 Aktien	2	Nachtrag.
1	8bf51ba8a0	1006345701_189900032	Deutsch-Oesterreichische Mannesmannröhren-Werke.	Sitz in Berlin, Generaldirektion in Düsseldorf…	NaN	222	1890.0	1890-07-16	NaN	NaN	…	NaN	1.0	Betrieb der Mannesmannröhren-Walzwerke in Rems…	Berlin	Generaldirektion Grundst ckskonto M Mobilien U…	Vors Bankdirektor Max Steinthal Stellv Bankdir…	Dr M Fuchs A Krusche Berlin G Hethey N Eich	M 25 900 000 in 23 875 Inhaber Aktien Lit	3	Bergwerke, Hütten- und Salinenwesen.
2	90b6db7ed3	1006345701_191900232	Handwerkerbank Spaichingen, Akt.-Ges. in Spaic…	(in Liquidation).	NaN	168	1889.0	1889-11-24	NaN	NaN	…	NaN	1.0	Betrieb von Bank- und Kommissionsgeschäften in…	Spaichingen	NaN	Vors Wilh Lobmiller Stellv Frans Nav Schmid Sa…	NaN	M 600 000 in 600 Aktien M 1000 Urspr M	23	Geld-Institute etc.
3	b0c68f1152	1006345701_191400182	%% für Malchin A.-G. in Malchin.	(In Liquidation.) Letzte Statutänd. 10./7. 190…	NaN	193	NaN	NaN	NaN	NaN	…	NaN	NaN	NaN	Malchin	Forder Effekten u Hypoth Debit Bankguth Kassa …	W Deutler E Buhr W Fehlow	NaN	NaN	18	Kredit-Banken und andere Geld-Institute.
4	e9823a3073	1006345701_190700112	Kaisersteinbruch-Actiengesellschaft in Köln,	Zweiggeschäfte in Berlin u. Hamburg.	NaN	818	1900.0	1900-03-17	1900.0	1900-04-11	…	1900.0	1.0	Betrieb von Steinhauereien u. aller mit dem Ba…	Köln	Steinbr che Steinhauerei Grundst ck Mannheim M…	Vors Rechtsanw Dr jur Max Liertz Stellv Stadtb…	NaN	M 900 000 in 900 Aktien wovon 600 abgest M	11	Industrie der Steine und Erden.

5 rows × 21 columns

Defining Features and Similarity Concepts

The similarity_map defines which similarity concepts (values) to apply to each feature pair (keys). Note that this example uses a minimal similarity map for simplicity rather than optimal performance.

from neer_match.similarity_map import SimilarityMap
from neer_match_utilities.custom_similarities import CustomSimilarities

CustomSimilarities() # Ensures Similarity concepts are always scaled between 0 and 1.

# Define similarity_map

similarity_map = {
    "company_name" : [
        "levenshtein",
        "jaro_winkler",
        "partial_token_sort_ratio",
    ],
    "city" : [
        "levenshtein",
    ],
    "industry" : [
        "levenshtein",
        "jaro_winkler",
        "notmissing",
    ],
}

smap = SimilarityMap(similarity_map)

Harmonizing the data

Left and Right

Next, data formatting can be harmonized using the Prepare class. This class offers flexible arguments for operations such as capitalizing strings, converting values to numeric types, and filling missing values. Additionally, a spaCy pipeline and custom stop words can be specified to remove noise from string variables (see additional functionalities). All operations are applied consistently to both the left and right DataFrames.

from neer_match_utilities.prepare import Prepare

# Initialize the Prepare object

prepare = Prepare(
    similarity_map=similarity_map, 
    df_left=left, 
    df_right=right, 
    id_left='company_id', 
    id_right='company_id',
)

# Get formatted and harmonized datasets

left, right = prepare.format(
    fill_numeric_na=False,
    to_numeric=['found_year'],
    fill_string_na=True, 
    capitalize=True,
    lower_case=False,
)

left.head()

	company_id	company_name	city	industry
0	1e87fc75b4	GLÜCKAUF-, ACTIEN-GESELLSCHAFT FÜR BRAUNKOHLEN…	LICHTENAU	BERGWERKE, HÜTTEN- UND SALINENWESEN.
1	810c9c3435	DEUTSCH-OESTERREICHISCHE MANNESMANNRÖHREN-WERKE	BERLIN	BERGWERKE, HÜTTEN- UND SALINENWESEN.
2	571dfb67e2	HANDWERKERBANK SPAICHINGEN, AKT.-GES. IN SPAIC…	SPAICHINGEN	KREDIT-BANKEN UND ANDERE GELD-INSTITUTE.
3	d67d97da08	VORSCHUSS-ANSTALT FÜR MALCHIN A.-G.	A	GELD-INSTITUTE ETC.
4	22ac99ae20	KAISERSTEINBRUCH-ACTIENGESELLSCHAFT IN LIQU. I…	KÖLN	INDUSTRIE DER STEINE UND ERDEN.

Re-Structuring the `Matches` dataframe

neer-match requires that the matches DataFrame be structured with the indices from the left and right datasets instead of their unique IDs. To convert your matches DataFrame into the required format, you can run:

from neer_match_utilities.training import Training

training = Training(
    similarity_map=similarity_map, 
    df_left=left, 
    df_right=right, 
    id_left='company_id', 
    id_right='company_id',
)

matches = training.matches_reorder(
    matches, 
    matches_id_left='company_id_left', 
    matches_id_right='company_id_right'
)

matches.head()

	left	right
0	0	0
1	1	1
2	2	2
3	3	3
4	4	4

Splitting Data

Subsequently, we need to split the data into training and test sets, each consisting of three DataFrames. The training ratio is given by \(\text{training_ratio} = 1 - (\text{test_ratio} + \text{validation_ratio})\). Note that since validation is not implemented yet, you can set \(\text{validation_ratio} = 0\).

from neer_match_utilities.split import split_test_train

left_train, right_train, matches_train, left_validation, right_validation, matches_validation, left_test, right_test, matches_test = split_test_train(
    left = left,
    right = right,
    matches = matches,
    test_ratio = .5,
    validation_ratio = .0
)

Training and Exporting the Model

For this tutorial, we use a simple Logit model. Other models (ANN, Probit, or GradientBoost) follow a similar syntax and are covered in alternative models.

from neer_match_utilities.baseline_training import BaselineTrainingPipe
import pandas as pd
import os

training_pipeline = BaselineTrainingPipe(
    model_name='demonstration_model',
    similarity_map=smap,
    training_data=(left_train, right_train, matches_train),
    validation_data=(left_validation, right_validation, matches_validation),  # only needed if tune_threshold for GB
    testing_data=(left_test, right_test, matches_test),
    id_left_col="company_id",
    id_right_col="company_id",
    # matches_id_left="left",
    # matches_id_right="right",
    model_kind="logit", # "logit" | "probit" | "gb"
    mismatch_share_fit=1.0,
    # tune_threshold=False, # recommended for "gb"
    # tune_metric="mcc",
)

training_pipeline.execute()

Performance metrics saved to /Users/marli453/develop/py-neer-utilities/docs/source/_static/examples/demonstration_model/performance.csv
Similarity map saved to /Users/marli453/develop/py-neer-utilities/docs/source/_static/examples/demonstration_model/similarity_map.dill
Baseline model saved to /Users/marli453/develop/py-neer-utilities/docs/source/_static/examples/demonstration_model/model

LogitMatchingModel(result=<statsmodels.discrete.discrete_model.BinaryResultsWrapper object at 0x12f679ed0>, feature_cols=['col_city_city_levenshtein', 'col_company_name_company_name_jaro_winkler', 'col_company_name_company_name_levenshtein', 'col_company_name_company_name_partial_token_sort_ratio', 'col_industry_industry_jaro_winkler', 'col_industry_industry_levenshtein'])