Similarity Features
- class neer_match_utilities.similarity_features.SimilarityFeatures(similarity_map)[source]
- __init__(similarity_map)
- pairwise_similarity_dataframe(left, right, matches, left_id_col, right_id_col, match_col='match', matches_id_left='left', matches_id_right='right', matches_are_indices=True)[source]
Build full cross-join of left × right, compute similarity features specified in SimilarityMap, and attach match indicator.
- Parameters:
left (
DataFrame) – Left- and right-hand side entity tables.right (
DataFrame) – Left- and right-hand side entity tables.matches (
DataFrame) – DataFrame describing which pairs are true matches. If matches_are_indices=True, matches[matches_id_left] and matches[matches_id_right] are interpreted as row indices into left and right (0..n-1). If False, they are interpreted as IDs in the same space as left[left_id_col] / right[right_id_col].left_id_col (
str) – Column names in left / right that contain the entity IDs.right_id_col (
str) – Column names in left / right that contain the entity IDs.match_col (
str) – Name of the binary match indicator column in the output.matches_id_left (
str) – Column names in matches identifying the left/right side.matches_id_right (
str) – Column names in matches identifying the left/right side.matches_are_indices (
bool) – If True (default), treat matches_id_left / matches_id_right as row indices into left and right. If False, treat them as IDs.
- Return type:
DataFrame
- neer_match_utilities.similarity_features.subsample_non_matches(df, match_col='match', mismatch_share=1.0, random_state=None, shuffle=True)[source]
Return a subsample of df where all matches are kept and a fraction of non-matches is sampled.
- Parameters:
df (
DataFrame) – DataFrame with a binary match column.match_col (
str) – Name of the binary target column (1 = match, 0 = non-match).mismatch_share (
float) – Share of non-matches to keep. Must satisfy 0 < mismatch_share <= 1.0. - 1.0 → keep all non-matches - 0.1 → keep 10% of non-matchesrandom_state (
int|None) – Random seed for reproducible sampling.shuffle (
bool) – If True, shuffle the resulting DataFrame.
- Returns:
Subsampled DataFrame with all matches and a subset of non-matches.
- Return type:
df_sub
- neer_match_utilities.similarity_features.to_X_y(df, match_col='match')[source]
Extract (X, y) from a pairwise similarity DataFrame.
- Parameters:
df (pd.DataFrame) – DataFrame produced by AlternativeModels.pairwise_similarity_dataframe().
match_col (str, default "match") – Name of the binary match indicator column.
- Returns:
X (pd.DataFrame) – Feature matrix containing all similarity columns (col_*).
y (np.ndarray) – Target array (0/1).