Similarity Features

class neer_match_utilities.similarity_features.SimilarityFeatures(similarity_map)[source]

__init__(similarity_map)

pairwise_similarity_dataframe(left, right, matches, left_id_col, right_id_col, match_col='match', matches_id_left='left', matches_id_right='right', matches_are_indices=True)[source]

Build full cross-join of left × right, compute similarity features specified in SimilarityMap, and attach match indicator.

Parameters:

left (DataFrame) – Left- and right-hand side entity tables.
right (DataFrame) – Left- and right-hand side entity tables.
matches (DataFrame) – DataFrame describing which pairs are true matches. If matches_are_indices=True, matches[matches_id_left] and matches[matches_id_right] are interpreted as row indices into left and right (0..n-1). If False, they are interpreted as IDs in the same space as left[left_id_col] / right[right_id_col].
left_id_col (str) – Column names in left / right that contain the entity IDs.
right_id_col (str) – Column names in left / right that contain the entity IDs.
match_col (str) – Name of the binary match indicator column in the output.
matches_id_left (str) – Column names in matches identifying the left/right side.
matches_id_right (str) – Column names in matches identifying the left/right side.
matches_are_indices (bool) – If True (default), treat matches_id_left / matches_id_right as row indices into left and right. If False, treat them as IDs.

Return type:

DataFrame

neer_match_utilities.similarity_features.subsample_non_matches(df, match_col='match', mismatch_share=1.0, random_state=None, shuffle=True)[source]

Return a subsample of df where all matches are kept and a fraction of non-matches is sampled.

Parameters:

df (DataFrame) – DataFrame with a binary match column.
match_col (str) – Name of the binary target column (1 = match, 0 = non-match).
mismatch_share (float) – Share of non-matches to keep. Must satisfy 0 < mismatch_share <= 1.0. - 1.0 → keep all non-matches - 0.1 → keep 10% of non-matches
random_state (int | None) – Random seed for reproducible sampling.
shuffle (bool) – If True, shuffle the resulting DataFrame.

Returns:

Subsampled DataFrame with all matches and a subset of non-matches.

Return type:

df_sub

neer_match_utilities.similarity_features.to_X_y(df, match_col='match')[source]

Extract (X, y) from a pairwise similarity DataFrame.

Parameters:

df (pd.DataFrame) – DataFrame produced by AlternativeModels.pairwise_similarity_dataframe().
match_col (str, default "match") – Name of the binary match indicator column.

Returns:

X (pd.DataFrame) – Feature matrix containing all similarity columns (col_*).
y (np.ndarray) – Target array (0/1).