Neural Networks for Entity Matching in Economic History
The project is funded by DFG as part of the Infrastructure Priority Program New Data Spaces for the Social Sciences (SPP2431) under Grant 539465691.
| (index) | Model (alphanumeric) | Producer (alphanumeric) | Origin (alphanumeric) | Sales (in Mil.) (numeric) |
|---|---|---|---|---|
| 1 | Model T | Ford | USA | 16.5 |
| 2 | Model A | Ford | USA | 4.8 |
| 3 | Beetle | Volkswagen | Germany | 21.5 |
| (index) | Name (alphanumeric) | Firm (alphanumeric) | Country (alphanumeric) | Engine (in Lt.) (numeric) |
|---|---|---|---|---|
| 1 | T Model | Ford | United States | 2.9 |
| 2 | Corolla | Toyota | Japan | 1.8 |
| 3 | Beetle | Volkswagen | Germany | 1.6 |
| 4 | Mdl 124 | Fiat | Italy | 1.4 |
A record matching toy example with two sources.
| (index) | Model (alphanumeric) | Producer (alphanumeric) | Origin (alphanumeric) | Sales (in Mil.) (numeric) |
|---|---|---|---|---|
| 1 | Model T | Ford | USA | 16.5 |
| 2 | Model A | Ford | USA | 4.8 |
| 3 | Beetle | Volkswagen | Germany | 21.5 |
| (index) | Name (alphanumeric) | Firm (alphanumeric) | Country (alphanumeric) | Engine (in Lt.) (numeric) |
|---|---|---|---|---|
| 1 | T Model | Ford | United States | 2.9 |
| 2 | Corolla | Toyota | Japan | 1.8 |
| 3 | Beetle | Volkswagen | Germany | 1.6 |
| 4 | Mdl 124 | Fiat | Italy | 1.4 |
(Levenshtein 1965 similarity) (Model T, Mdl 124) = 0.8
(Hamming 1950 similarity) (Model T, Mdl 124) = 0.75
| (index) | Model (alphanumeric) | Producer (alphanumeric) | Origin (alphanumeric) | Sales (in Mil.) (numeric) |
|---|---|---|---|---|
| 1 | Model T | Ford | USA | 16.5 |
| 2 | Model A | Ford | USA | 4.8 |
| 3 | Beetle | Volkswagen | Germany | 21.5 |
| (index) | Name (alphanumeric) | Firm (alphanumeric) | Country (alphanumeric) | Engine (in Lt.) (numeric) |
|---|---|---|---|---|
| 1 | T Model | Ford | United States | 2.9 |
| 2 | Corolla | Toyota | Japan | 1.8 |
| 3 | Beetle | Volkswagen | Germany | 1.6 |
| 4 | Mdl 124 | Fiat | Italy | 1.4 |
(Levenshtein 1965 similarity) (Model T, T Model) = 0.75
(Token sort ratio) (Model T, T Model) = 1
| (1) Iteration | (2) TP | (3) FP | (4) TN | (5) FN | (6) Accuracy | (7) Precision | (8) Recall | (9) F-Score |
|---|---|---|---|---|---|---|---|---|
| 1 | 256 | 0 | 6430 | 2 | 99.97 | 100 | 99.22 | 99.61 |
| 2 | 253 | 0 | 6430 | 5 | 99.93 | 100 | 98.06 | 99.02 |
| 3 | 256 | 2 | 6428 | 2 | 99.94 | 99.22 | 99.22 | 99.22 |
| 4 | 257 | 0 | 6430 | 1 | 99.99 | 100 | 99.61 | 99.81 |
| 5 | 258 | 4 | 6426 | 0 | 99.94 | 98.47 | 100 | 99.23 |
| Average | 256 | 1.2 | 6428.8 | 2 | 99.95 | 99.53 | 99.22 | 99.38 |
| (1) Iteration | (2) TP | (3) FP | (4) TN | (5) FN | (6) Accuracy | (7) Precision | (8) Recall | (9) F-Score |
|---|---|---|---|---|---|---|---|---|
| 1 | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
| 2 | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
| 3 | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
| 4 | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
| 5 | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
| Average | 54 | 0 | 266 | 0 | 100 | 100 | 100 | 100 |
similarity_map = {
"company_name": [
"levenshtein",
"partial",
my_custom_awesome_similarity
],
"address~address1": [ "partial" ],
"address~address2": [ "partial" ],
"purpose": [
"sort",
lambda x, y: x*y + 0.42 - y*x
],
"foundation": [
"discrete",
"partial"
]
}
model = match.MatchingModel(similarity_map)
model.compile(
loss="binary_crossentropy",
optimizer=tensorflow.keras.optimizers.Adam(learning_rate=0.01),
metrics=evaluation_metrics)
train_left, train_right, train_matches = load_train_data()
model.fit(train_left, train_right, train_matches, epochs=100)
model.evaluate(train_left, train_right, train_matches)
predictions = model.predict(train_left, train_right)
suggestions = model.suggest(train_left, train_right, 3)similarity_map <- list(
company_name = c(
"levenshtein",
"partial",
my_custom_awesome_similarity
),
`address~address1` = c("partial"),
`address~address2` = c("partial"),
purpose = c(
"sort",
function(x, y) x*y + 0.42 - y*x
),
foundation = c(
"discrete",
"partial"
)
)
model <- matching_model(similarity_map)
model |> compile(
loss = keras::loss_binary_crossentropy(),
optimizer = keras::optimizer_adam(learning_rate = 1e-3),
metrics = evaluation_metrics)
train_left, train_right, train_matches <- load_train_data()
model |> fit(left_train, right_train, matches_train, epochs = 100L)
model |> evaluate(left_test, right_test, matches_test)
predictions <- model |> predict(left, right)
suggestions <- model |> suggest(left, right, count = 3)









| Left Field | Right Field | Similarities | Ratios |
|---|---|---|---|
| company name | company name | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| company info 1 | company info 1 | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| company info 2 | company info 2 | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| found date | found date | discrete | |
| found year | found year | discrete | |
| register date | register date | discrete | |
| register year | register year | discrete | |
| concession date | concession date | discrete | |
| concession year | concession year | discrete | |
| statue change date | statue change date | discrete | |
| company name | company info 1 | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| company name | company info 2 | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| company info 1 | company info 2 | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| Left Field | Right Field | Similarities | Ratios |
|---|---|---|---|
| main info | main info | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| Vorstand | Vorstand | Levenshtein, Jaro-Winkel | |
| StVdAR | StVdAR | Levenshtein, Jaro-Winkel | |
| GeschF | GeschF | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| Leiter | Leiter | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| Beirat | Beirat | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| AR | AR | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| name | name | Levenshtein, Jaro-Winkel, discrete | |
| surname | surname | Levenshtein, Jaro-Winkel, discrete | |
| occupation | occupation | Levenshtein, Jaro-Winkel, discrete | |
| address | address | Levenshtein, Jaro-Winkel | partial, token sort, token set, partial token set |
| birth date | birth date | discrete | |
| raw text | raw text | token set, partial token set |