Similarity Features

class neer_match_utilities.similarity_features.SimilarityFeatures(similarity_map)[source]
__init__(similarity_map)
pairwise_similarity_dataframe(left, right, matches, left_id_col, right_id_col, match_col='match', matches_id_left='left', matches_id_right='right', matches_are_indices=True)[source]

Build full cross-join of left × right, compute similarity features specified in SimilarityMap, and attach match indicator.

Parameters:
  • left (DataFrame) – Left- and right-hand side entity tables.

  • right (DataFrame) – Left- and right-hand side entity tables.

  • matches (DataFrame) – DataFrame describing which pairs are true matches. If matches_are_indices=True, matches[matches_id_left] and matches[matches_id_right] are interpreted as row indices into left and right (0..n-1). If False, they are interpreted as IDs in the same space as left[left_id_col] / right[right_id_col].

  • left_id_col (str) – Column names in left / right that contain the entity IDs.

  • right_id_col (str) – Column names in left / right that contain the entity IDs.

  • match_col (str) – Name of the binary match indicator column in the output.

  • matches_id_left (str) – Column names in matches identifying the left/right side.

  • matches_id_right (str) – Column names in matches identifying the left/right side.

  • matches_are_indices (bool) – If True (default), treat matches_id_left / matches_id_right as row indices into left and right. If False, treat them as IDs.

Return type:

DataFrame

neer_match_utilities.similarity_features.subsample_non_matches(df, match_col='match', mismatch_share=1.0, random_state=None, shuffle=True)[source]

Return a subsample of df where all matches are kept and a fraction of non-matches is sampled.

Parameters:
  • df (DataFrame) – DataFrame with a binary match column.

  • match_col (str) – Name of the binary target column (1 = match, 0 = non-match).

  • mismatch_share (float) – Share of non-matches to keep. Must satisfy 0 < mismatch_share <= 1.0. - 1.0 → keep all non-matches - 0.1 → keep 10% of non-matches

  • random_state (int | None) – Random seed for reproducible sampling.

  • shuffle (bool) – If True, shuffle the resulting DataFrame.

Returns:

Subsampled DataFrame with all matches and a subset of non-matches.

Return type:

df_sub

neer_match_utilities.similarity_features.to_X_y(df, match_col='match')[source]

Extract (X, y) from a pairwise similarity DataFrame.

Parameters:
  • df (pd.DataFrame) – DataFrame produced by AlternativeModels.pairwise_similarity_dataframe().

  • match_col (str, default "match") – Name of the binary match indicator column.

Returns:

  • X (pd.DataFrame) – Feature matrix containing all similarity columns (col_*).

  • y (np.ndarray) – Target array (0/1).