Creating a Common Identifier (ID)
Between Two Sources
1. Load Data & Model
Use the demonstration_model to generate a common ID for observations from the left and right DataFrames. To do so, we first have to load both the model and the data into memory.
from neer_match_utilities.prepare import Prepare
from neer_match_utilities.custom_similarities import CustomSimilarities
from neer_match_utilities.baseline_io import ModelBaseline
import pandas as pd
from pathlib import Path
# Load custom similarity functions
CustomSimilarities()
# Load model (and the similarity map used during training)
loaded_model = ModelBaseline.load(
'demonstration_model'
)
# Load files
left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')
2. Harmonize Format
After loading the model and data, ensure that the data formatting
remains consistent with the preprocessing used during training. The
Prepare class harmonizes the left and right DataFrames. Note that the
similarity_map is automatically loaded with the model, so there is no
need to redefine it.
from neer_match_utilities.prepare import Prepare
prepare = Prepare(
similarity_map=loaded_model.similarity_map,
df_left=left,
df_right=right,
id_left='company_id',
id_right='company_id',
)
# Get formatted and harmonized datasets
left, right = prepare.format(
fill_numeric_na=False,
to_numeric=['found_year'],
fill_string_na=True,
capitalize=True,
lower_case=False,
)
3. Generate a Common ID
The GenerateID class creates a common identifier across multiple
repeated cross sections. Creating an ID for observations in the left
and right datasets can be seen as a special case with two periods.
Key parameters:
relation: Specifies the relationship type between observations (1:1,1:m,m:1, orm:m)panel_var: Name of the variable that stores the common identifierstime_var: Indicates the different cross sections (e.g., year for annual data)subgroups: Implements a blocking strategy by restricting comparisons to observations within each subgroup, which can significantly reduce computation time
To generate an ID for the left and right DataFrames, we first create a column (side) to distinguish the two sources, then stack the DataFrames vertically.
from neer_match_utilities.panel import GenerateID
left['side'] = 'left'
right['side'] = 'right'
df = pd.concat(
[
left,
right
],
axis=0,
ignore_index=True
)
# Create GenerateID instance
id_generator = GenerateID(
df_panel=df,
panel_var='panel_id',
time_var='side',
model=loaded_model,
prediction_threshold=0.5,
subgroups=[],
relation='m:m',
)
# Execute the ID generation
result = id_generator.execute()
result.head()
Processing periods left-right at 2026-02-04 11:44:27.668970
index |
panel_id |
|
|---|---|---|
0 |
0 |
cca694f1-9a27-4781-9cec-b31c93a9a1c2 |
1 |
1 |
87369043-f148-4153-b76a-ccc686ae60a5 |
2 |
2 |
fcc834bd-8ac3-4210-98cb-3b38555f47b6 |
3 |
3 |
0eb2a6ec-603e-4538-8900-258396898832 |
4 |
4 |
9e66fbc1-3c72-413b-a713-08fbe2f1f3d3 |
4. Merge Results
df = pd.merge(
df,
result,
left_index=True,
right_on='index',
validate='1:1'
)
df = df.sort_values(['panel_id', 'side', 'company_id']).reset_index(drop=True)
# Prepare selection to be viewed
selected_ids = ['22ac99ae20', 'e9823a3073']
columns_to_show = [
'panel_id',
'company_id',
'side',
'company_name',
'city',
]
df_selection= df[df['company_id'].isin(selected_ids)][columns_to_show]
df_selection
panel_id |
company_id |
side |
company_name |
city |
|
|---|---|---|---|---|---|
849 |
9e66fbc1-3c72-413b-a713-08fbe2f1f3d3 |
22ac99ae20 |
left |
KAISERSTEINBRUCH-ACTIENGESELLSCHAFT IN LIQU. I… |
KÖLN |
850 |
9e66fbc1-3c72-413b-a713-08fbe2f1f3d3 |
e9823a3073 |
right |
KAISERSTEINBRUCH-ACTIENGESELLSCHAFT IN KÖLN, |
KÖLN |
Repeated Cross-Sections (Panel ID)
The approach demonstrated above generalizes to multiple repeated cross sections, not just the two sources (left and right) shown here. The same logic applies regardless of the number of cross sections (i.e., periods), enabling consistent ID generation across your entire dataset.