Creating a Common Identifier (ID)

Between Two Sources

1. Load Data & Model

Use the demonstration_model to generate a common ID for observations from the left and right DataFrames. To do so, we first have to load both the model and the data into memory.

from neer_match_utilities.prepare import Prepare
from neer_match_utilities.custom_similarities import CustomSimilarities
from neer_match_utilities.baseline_io import ModelBaseline

import pandas as pd
from pathlib import Path

# Load custom similarity functions

CustomSimilarities()

# Load model (and the similarity map used during training)

loaded_model = ModelBaseline.load(
    'demonstration_model'
)

# Load files

left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')

2. Harmonize Format

After loading the model and data, ensure that the data formatting remains consistent with the preprocessing used during training. The Prepare class harmonizes the left and right DataFrames. Note that the similarity_map is automatically loaded with the model, so there is no need to redefine it.

from neer_match_utilities.prepare import Prepare

prepare = Prepare(
    similarity_map=loaded_model.similarity_map, 
    df_left=left, 
    df_right=right, 
    id_left='company_id', 
    id_right='company_id',
)

# Get formatted and harmonized datasets

left, right = prepare.format(
    fill_numeric_na=False,
    to_numeric=['found_year'],
    fill_string_na=True, 
    capitalize=True,
    lower_case=False,
)

3. Generate a Common ID

The GenerateID class creates a common identifier across multiple repeated cross sections. Creating an ID for observations in the left and right datasets can be seen as a special case with two periods.

Key parameters:

  • relation: Specifies the relationship type between observations (1:1, 1:m, m:1, or m:m)

  • panel_var: Name of the variable that stores the common identifiers

  • time_var: Indicates the different cross sections (e.g., year for annual data)

  • subgroups: Implements a blocking strategy by restricting comparisons to observations within each subgroup, which can significantly reduce computation time

To generate an ID for the left and right DataFrames, we first create a column (side) to distinguish the two sources, then stack the DataFrames vertically.

from neer_match_utilities.panel import GenerateID

left['side'] = 'left'
right['side'] = 'right'

df = pd.concat(
    [
        left,
        right
    ],
    axis=0,
    ignore_index=True
)

# Create GenerateID instance

id_generator = GenerateID(
    df_panel=df,
    panel_var='panel_id',
    time_var='side',
    model=loaded_model,
    prediction_threshold=0.5,
    subgroups=[],
    relation='m:m',
)

# Execute the ID generation

result = id_generator.execute()
result.head()
Processing periods left-right at 2026-02-04 11:44:27.668970

index

panel_id

0

0

cca694f1-9a27-4781-9cec-b31c93a9a1c2

1

1

87369043-f148-4153-b76a-ccc686ae60a5

2

2

fcc834bd-8ac3-4210-98cb-3b38555f47b6

3

3

0eb2a6ec-603e-4538-8900-258396898832

4

4

9e66fbc1-3c72-413b-a713-08fbe2f1f3d3

4. Merge Results

df = pd.merge(
    df,
    result,
    left_index=True,
    right_on='index',
    validate='1:1'
)

df = df.sort_values(['panel_id', 'side', 'company_id']).reset_index(drop=True)

# Prepare selection to be viewed

selected_ids = ['22ac99ae20', 'e9823a3073']
columns_to_show = [
    'panel_id',
    'company_id',
    'side',
    'company_name',
    'city',
]

df_selection= df[df['company_id'].isin(selected_ids)][columns_to_show]

df_selection

panel_id

company_id

side

company_name

city

849

9e66fbc1-3c72-413b-a713-08fbe2f1f3d3

22ac99ae20

left

KAISERSTEINBRUCH-ACTIENGESELLSCHAFT IN LIQU. I…

KÖLN

850

9e66fbc1-3c72-413b-a713-08fbe2f1f3d3

e9823a3073

right

KAISERSTEINBRUCH-ACTIENGESELLSCHAFT IN KÖLN,

KÖLN

Repeated Cross-Sections (Panel ID)

The approach demonstrated above generalizes to multiple repeated cross sections, not just the two sources (left and right) shown here. The same logic applies regardless of the number of cross sections (i.e., periods), enabling consistent ID generation across your entire dataset.