# A Minimal Training Pipeline
This document explains the data preparation process for training our
matching model. The example data comes from a research project that
digitized historic records of German joint-stock companies [(Gram et
al. 2022)](https://dl.acm.org/doi/10.1145/3531533). The data contains
inconsistencies in spelling, primarily due to variations in abbreviation
conventions and OCR errors, across most variables. These challenges make
it a compelling real-world use case for entity matching.
The data consists of three files:
- *left.csv*
- *right.csv*
- *matches.csv*
## Loading the Data
Training the pipelines requires three datasets:
- `left` (observations from one source or period)
- `right` (observations from another source or period)
- `matches` (a dataframe where each row contains the unique IDs of
matching entities from `left` and `right`)
``` python
import random
import pandas as pd
matches = pd.read_csv('matches.csv')
left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')
```
Preview of the matches data:
``` python
matches.head()
```
| | company_id_left | company_id_right |
|-----|-----------------|------------------|
| 0 | 1e87fc75b4 | 0008e07878 |
| 1 | 810c9c3435 | 8bf51ba8a0 |
| 2 | 571dfb67e2 | 90b6db7ed3 |
| 3 | d67d97da08 | b0c68f1152 |
| 4 | 22ac99ae20 | e9823a3073 |
Preview of the left dataset:
``` python
left.head()
```
| | company_id | oai_identifier | company_name | company_info_1 | company_info_2 | pdf_page_num | found_year | found_date_modified | register_year | register_date_modified | ... | effect_year | item_rank | purpose | city | bs_text | sboard_text | proc_text | capital_text | volume | industry |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 0 | 1e87fc75b4 | 1006345701_18970010 | Glückauf-, Actien-Gesellschaft für Braunkohlen... | NaN | NaN | 627 | 1871.0 | 1871-08-03 | NaN | NaN | ... | NaN | 1.0 | Abbau von Braunkohlenlagern u. Brikettfabrikat... | Lichtenau | Grundst cke M Grubenwert M Schachtanlagen M Ge... | sichtsrat Vors Buchh ndler Abel Dietzel Gumper... | NaN | M 660 000 in 386 Priorit tsaktien M 1 500 | 1 | Bergwerke, Hütten- und Salinenwesen. |
| 1 | 810c9c3435 | 1006345701_189900031 | Deutsch-Oesterreichische Mannesmannröhren-Werke | in Berlin W. u. Düsseldorf mit Zweigniederlass... | NaN | 501 | 1890.0 | 1890-07-16 | NaN | NaN | ... | NaN | 1.0 | Betrieb der Mannesmannröhren-Walzwerke in Rems... | Berlin | Generaldirektion D sseldorf Mobiliar u Utensil... | Vors Direktor Max Steinthal Stellv Karl v d He... | Dr M Fuchs A Krusche Berlin G Hethey N Eich | M 25 900 000 in 23 875 Inhaber Aktien Lit | 3 | Bergwerke, Hütten- und Salinenwesen. |
| 2 | 571dfb67e2 | 1006345701_191900231 | Handwerkerbank Spaichingen, Akt.-Ges. in Spaic... | NaN | NaN | 345 | 1889.0 | 1889-11-24 | NaN | NaN | ... | NaN | 1.0 | Betrieb von Bank- und Kommissionsgeschäften in... | Spaichingen | Forderung an Aktion re Immobil Gerichtskosten ... | Vors Wilh Lobmiller Stellv Franz Xav Schmid Sa... | NaN | M 600 000 in 600 Aktien M 1000 Urspr M | 23 | Kredit-Banken und andere Geld-Institute. |
| 3 | d67d97da08 | 1006345701_191300172 | Vorschuss-Anstalt für Malchin A.-G. | NaN | Letzte Statutänd. 10./7. 1900. Kapital: M. 900... | 165 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | A | Forder Effekten u Hypoth Debit Bankguth Kassa ... | W Deutler E Buhr W Fehlow | NaN | NaN | 17 | Geld-Institute etc. |
| 4 | 22ac99ae20 | 1006345701_191200161 | Kaisersteinbruch-Actiengesellschaft in Liqu. i... | NaN | NaN | 1443 | 1900.0 | 1900-03-17 | 1900.0 | 1900-04-11 | ... | 1900.0 | 1.0 | Betrieb von Steinhauereien u. aller mit dem Ba... | Köln | Steinbr che Steinhauerei Immobil Mannheim Mobi... | Vors Dr jur P Stephan Rheinbreitbach b Unkel S... | NaN | M 450 000 in 150abgest Vorz Aktien u 300 doppelt | 16 | Industrie der Steine und Erden. |
5 rows × 21 columns
Preview of the right dataset:
``` python
right.head()
```
| | company_id | oai_identifier | company_name | company_info_1 | company_info_2 | pdf_page_num | found_year | found_date_modified | register_year | register_date_modified | ... | effect_year | item_rank | purpose | city | bs_text | sboard_text | proc_text | capital_text | volume | industry |
|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
| 0 | 0008e07878 | 1006345701_189800021 | „Glückauf-', Act.-Ges. für Braunkohlen-Verwert... | NaN | NaN | 1038 | 1871.0 | 1871-08-03 | NaN | NaN | ... | NaN | 1.0 | Abbau von Braunkohlenlagern u. Brikettfabrikat... | NaN | Grundst cke M Grubenwert M Schachtanlagen M Ge... | Vors Buchh ndler Abel Dietzel Gumpert Lehmann ... | NaN | M 660 000 in 386 Vorzugsaktien M 1500 14 Aktien | 2 | Nachtrag. |
| 1 | 8bf51ba8a0 | 1006345701_189900032 | Deutsch-Oesterreichische Mannesmannröhren-Werke. | Sitz in Berlin, Generaldirektion in Düsseldorf... | NaN | 222 | 1890.0 | 1890-07-16 | NaN | NaN | ... | NaN | 1.0 | Betrieb der Mannesmannröhren-Walzwerke in Rems... | Berlin | Generaldirektion Grundst ckskonto M Mobilien U... | Vors Bankdirektor Max Steinthal Stellv Bankdir... | Dr M Fuchs A Krusche Berlin G Hethey N Eich | M 25 900 000 in 23 875 Inhaber Aktien Lit | 3 | Bergwerke, Hütten- und Salinenwesen. |
| 2 | 90b6db7ed3 | 1006345701_191900232 | Handwerkerbank Spaichingen, Akt.-Ges. in Spaic... | (in Liquidation). | NaN | 168 | 1889.0 | 1889-11-24 | NaN | NaN | ... | NaN | 1.0 | Betrieb von Bank- und Kommissionsgeschäften in... | Spaichingen | NaN | Vors Wilh Lobmiller Stellv Frans Nav Schmid Sa... | NaN | M 600 000 in 600 Aktien M 1000 Urspr M | 23 | Geld-Institute etc. |
| 3 | b0c68f1152 | 1006345701_191400182 | %% für Malchin A.-G. in Malchin. | (In Liquidation.) Letzte Statutänd. 10./7. 190... | NaN | 193 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | Malchin | Forder Effekten u Hypoth Debit Bankguth Kassa ... | W Deutler E Buhr W Fehlow | NaN | NaN | 18 | Kredit-Banken und andere Geld-Institute. |
| 4 | e9823a3073 | 1006345701_190700112 | Kaisersteinbruch-Actiengesellschaft in Köln, | Zweiggeschäfte in Berlin u. Hamburg. | NaN | 818 | 1900.0 | 1900-03-17 | 1900.0 | 1900-04-11 | ... | 1900.0 | 1.0 | Betrieb von Steinhauereien u. aller mit dem Ba... | Köln | Steinbr che Steinhauerei Grundst ck Mannheim M... | Vors Rechtsanw Dr jur Max Liertz Stellv Stadtb... | NaN | M 900 000 in 900 Aktien wovon 600 abgest M | 11 | Industrie der Steine und Erden. |
5 rows × 21 columns
## Defining Features and Similarity Concepts
The `similarity_map` defines which similarity concepts (values) to apply
to each feature pair (keys). Note that this example uses a minimal
similarity map for simplicity rather than optimal performance.
``` python
from neer_match.similarity_map import SimilarityMap
from neer_match_utilities.custom_similarities import CustomSimilarities
CustomSimilarities() # Ensures Similarity concepts are always scaled between 0 and 1.
# Define similarity_map
similarity_map = {
"company_name" : [
"levenshtein",
"jaro_winkler",
"partial_token_sort_ratio",
],
"city" : [
"levenshtein",
],
"industry" : [
"levenshtein",
"jaro_winkler",
"notmissing",
],
}
smap = SimilarityMap(similarity_map)
```
## Harmonizing the data
### Left and Right
Next, data formatting can be harmonized using the `Prepare` class. This
class offers flexible arguments for operations such as capitalizing
strings, converting values to numeric types, and filling missing values.
Additionally, a spaCy pipeline and custom stop words can be specified to
remove noise from string variables (see [additional
functionalities](additional_functionalities.md)). All operations are
applied consistently to both the *left* and *right* DataFrames.
``` python
from neer_match_utilities.prepare import Prepare
# Initialize the Prepare object
prepare = Prepare(
similarity_map=similarity_map,
df_left=left,
df_right=right,
id_left='company_id',
id_right='company_id',
)
# Get formatted and harmonized datasets
left, right = prepare.format(
fill_numeric_na=False,
to_numeric=['found_year'],
fill_string_na=True,
capitalize=True,
lower_case=False,
)
```
``` python
left.head()
```
| | company_id | company_name | city | industry |
|----|----|----|----|----|
| 0 | 1e87fc75b4 | GLÜCKAUF-, ACTIEN-GESELLSCHAFT FÜR BRAUNKOHLEN... | LICHTENAU | BERGWERKE, HÜTTEN- UND SALINENWESEN. |
| 1 | 810c9c3435 | DEUTSCH-OESTERREICHISCHE MANNESMANNRÖHREN-WERKE | BERLIN | BERGWERKE, HÜTTEN- UND SALINENWESEN. |
| 2 | 571dfb67e2 | HANDWERKERBANK SPAICHINGEN, AKT.-GES. IN SPAIC... | SPAICHINGEN | KREDIT-BANKEN UND ANDERE GELD-INSTITUTE. |
| 3 | d67d97da08 | VORSCHUSS-ANSTALT FÜR MALCHIN A.-G. | A | GELD-INSTITUTE ETC. |
| 4 | 22ac99ae20 | KAISERSTEINBRUCH-ACTIENGESELLSCHAFT IN LIQU. I... | KÖLN | INDUSTRIE DER STEINE UND ERDEN. |
## Re-Structuring the `Matches` dataframe
`neer-match` requires that the *matches* DataFrame be structured with
the indices from the left and right datasets instead of their unique
IDs. To convert your *matches* DataFrame into the required format, you
can run:
``` python
from neer_match_utilities.training import Training
training = Training(
similarity_map=similarity_map,
df_left=left,
df_right=right,
id_left='company_id',
id_right='company_id',
)
matches = training.matches_reorder(
matches,
matches_id_left='company_id_left',
matches_id_right='company_id_right'
)
matches.head()
```
| | left | right |
|-----|------|-------|
| 0 | 0 | 0 |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
| 3 | 3 | 3 |
| 4 | 4 | 4 |
## Splitting Data
Subsequently, we need to split the data into training and test sets,
each consisting of three DataFrames. The training ratio is given by
$\text{training_ratio} = 1 - (\text{test_ratio} + \text{validation_ratio})$.
Note that since validation is not implemented yet, you can set
$\text{validation_ratio} = 0$.
``` python
from neer_match_utilities.split import split_test_train
left_train, right_train, matches_train, left_validation, right_validation, matches_validation, left_test, right_test, matches_test = split_test_train(
left = left,
right = right,
matches = matches,
test_ratio = .5,
validation_ratio = .0
)
```
## Training and Exporting the Model
For this tutorial, we use a simple Logit model. Other models (ANN,
Probit, or GradientBoost) follow a similar syntax and are covered in
[alternative models](alternative_models.md).
``` python
from neer_match_utilities.baseline_training import BaselineTrainingPipe
import pandas as pd
import os
training_pipeline = BaselineTrainingPipe(
model_name='demonstration_model',
similarity_map=smap,
training_data=(left_train, right_train, matches_train),
validation_data=(left_validation, right_validation, matches_validation), # only needed if tune_threshold for GB
testing_data=(left_test, right_test, matches_test),
id_left_col="company_id",
id_right_col="company_id",
# matches_id_left="left",
# matches_id_right="right",
model_kind="logit", # "logit" | "probit" | "gb"
mismatch_share_fit=1.0,
# tune_threshold=False, # recommended for "gb"
# tune_metric="mcc",
)
training_pipeline.execute()
```
Performance metrics saved to /Users/marli453/develop/py-neer-utilities/docs/source/_static/examples/demonstration_model/performance.csv
Similarity map saved to /Users/marli453/develop/py-neer-utilities/docs/source/_static/examples/demonstration_model/similarity_map.dill
Baseline model saved to /Users/marli453/develop/py-neer-utilities/docs/source/_static/examples/demonstration_model/model
LogitMatchingModel(result=, feature_cols=['col_city_city_levenshtein', 'col_company_name_company_name_jaro_winkler', 'col_company_name_company_name_levenshtein', 'col_company_name_company_name_partial_token_sort_ratio', 'col_industry_industry_jaro_winkler', 'col_industry_industry_levenshtein'])