Feature Selection
Feature selection utilities for similarity-based entity matching.
This module provides tools for selecting the most informative similarity features from a potentially large set of candidates. It implements a two-stage feature selection process:
Correlation-based filtering: Removes highly correlated features, keeping the one most correlated with the target variable.
Elastic net regularization: Uses penalized logistic regression to identify features that contribute unique predictive information.
The feature selector is designed to handle extreme class imbalance, which is common in entity matching tasks where true matches are rare.
- class neer_match_utilities.feature_selection.FeatureSelectionResult(updated_similarity_map, selected_feature_columns, selected_pairs, coef_by_feature, meta)[source]
Result object returned by FeatureSelector.execute().
- updated_similarity_map
Reduced similarity map containing only selected features. Keys are variable names, values are lists of similarity concept names.
- Type:
Dict[str, List[str]]
- selected_feature_columns
Names of selected feature columns in the format ‘col_{var}_{var}_{similarity}’.
- Type:
List[str]
- selected_pairs
List of (variable, similarity_concept) tuples that were selected.
- Type:
List[Tuple[str, str]]
- coef_by_feature
Coefficients from the final elastic net model, sorted by absolute value. Useful for understanding feature importance.
- Type:
pd.Series
- meta
Metadata about the selection process, including: - method: Feature selection method used - scoring: Scoring metric used for cross-validation - cv: Number of cross-validation folds - n_features_in: Number of input features - n_features_selected: Number of features selected - did_fallback: Whether fallback to original map occurred
- Type:
Dict[str, Any]
- __init__(updated_similarity_map, selected_feature_columns, selected_pairs, coef_by_feature, meta)
- class neer_match_utilities.feature_selection.FeatureSelector(similarity_map, training_data, *, id_left_col='id', id_right_col='id', matches_id_left='left', matches_id_right='right', match_col='match', matches_are_indices=True, method='elastic_net', scoring='average_precision', cv=5, l1_ratios=(0.8, 0.9, 1.0), Cs=20, class_weight='balanced', max_iter=5000, random_state=42, n_jobs=-1, min_coef_threshold=0.0, max_correlation=None, always_keep=None, preferred_separators=('__', '|', ':', '-', '_'))[source]
Supervised feature selector for similarity-based entity matching.
This class implements a two-stage feature selection process optimized for entity matching tasks with extreme class imbalance:
- Stage 1: Correlation-based filtering (optional)
Removes redundant features by identifying groups of highly correlated features and keeping only the one most correlated with the target.
- Stage 2: Elastic net regularization
Uses penalized logistic regression with L1/L2 penalties to identify features that contribute unique predictive information. Cross-validation is used to select optimal regularization parameters.
The selector returns an updated similarity map containing only the features that passed both selection stages.
- Parameters:
similarity_map (Dict[str, List[str]] or SimilarityMap) – Mapping from variable names to lists of similarity concepts. Example: {‘name’: [‘jaro_winkler’, ‘levenshtein’], ‘address’: [‘cosine’]}
training_data (Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]) – Three-tuple of (left_df, right_df, matches_df) used for feature selection.
id_left_col (str, default='id') – Column names containing entity IDs in left and right dataframes.
id_right_col (str, default='id') – Column names containing entity IDs in left and right dataframes.
matches_id_left (str, default='left', 'right') – Column names in matches_df identifying left/right entity IDs.
matches_id_right (str, default='left', 'right') – Column names in matches_df identifying left/right entity IDs.
match_col (str, default='match') – Name of the binary match indicator column (1=match, 0=non-match).
matches_are_indices (bool, default=True) – If True, treat match IDs as integer row indices. If False, treat as entity IDs.
method (str, default='elastic_net') – Feature selection method. Currently only ‘elastic_net’ is supported.
scoring (str, default='average_precision') – Scoring metric for cross-validation. Options: ‘f1’, ‘roc_auc’, ‘average_precision’, ‘neg_log_loss’. For imbalanced data, ‘average_precision’ is recommended.
cv (int, default=5) – Number of cross-validation folds.
l1_ratios (tuple, default=(0.8, 0.9, 1.0)) – L1 penalty ratios to try. 1.0 = pure Lasso, 0.0 = pure Ridge.
Cs (int or Iterable[float], default=20) – Regularization strengths to try. If int, generates Cs values logarithmically spaced from 0.01 to 1000. Higher C = less regularization.
class_weight (str or None, default='balanced') – Class weighting strategy. ‘balanced’ adjusts weights inversely proportional to class frequencies, recommended for imbalanced data.
max_iter (int, default=5000) – Maximum iterations for elastic net solver.
random_state (int, default=42) – Random seed for reproducibility.
n_jobs (int, default=-1) – Number of parallel jobs. -1 uses all available processors.
min_coef_threshold (float, default=0.0) – Minimum absolute coefficient value for feature retention. Features with abs(coef) < threshold are dropped. Set to 0.0 to keep all non-zero features.
max_correlation (float or None, default=None) – Correlation threshold for Stage 1 filtering. Features with pairwise correlation > threshold are candidates for removal. Example: 0.95. If None, Stage 1 is skipped.
always_keep (Dict[str, List[str]] or None, default=None) – Features to always retain regardless of selection results. Example: {‘surname’: [‘jaro_winkler’]} always keeps surname jaro_winkler.
preferred_separators (tuple, default=('__', '|', ':', '-', '_')) – Separators to try when parsing feature names (internal use).
- similarity_map
The input similarity map.
- Type:
Dict[str, List[str]]
- left_train, right_train, matches_train
Training datasets for feature selection.
- Type:
pd.DataFrame
Examples
>>> from neer_match_utilities import FeatureSelector >>> selector = FeatureSelector( ... similarity_map={'name': ['jaro_winkler', 'levenshtein', 'cosine'], ... 'address': ['jaro_winkler', 'levenshtein']}, ... training_data=(left_df, right_df, matches_df), ... max_correlation=0.95, ... min_coef_threshold=0.01 ... ) >>> result = selector.execute() >>> print(result.updated_similarity_map) {'name': ['jaro_winkler'], 'address': ['levenshtein']}
- __init__(similarity_map, training_data, *, id_left_col='id', id_right_col='id', matches_id_left='left', matches_id_right='right', match_col='match', matches_are_indices=True, method='elastic_net', scoring='average_precision', cv=5, l1_ratios=(0.8, 0.9, 1.0), Cs=20, class_weight='balanced', max_iter=5000, random_state=42, n_jobs=-1, min_coef_threshold=0.0, max_correlation=None, always_keep=None, preferred_separators=('__', '|', ':', '-', '_'))[source]
- execute()[source]
Execute the two-stage feature selection process.
This method performs: 1. Builds pairwise similarity features from training data 2. Cleans data (removes constant columns, handles missing values) 3. Stage 1 (optional): Correlation-based filtering 4. Scales features for regularization 5. Stage 2: Elastic net cross-validation and feature selection 6. Applies coefficient thresholding (if configured) 7. Converts selected features back to similarity map format
- Returns:
Object containing the updated similarity map, selected features, coefficients, and metadata about the selection process.
- Return type:
- Raises:
ValueError – If the method is not ‘elastic_net’, or if no usable features remain after cleaning or correlation filtering, or if there are too few positive examples for cross-validation.
Notes
The method prints detailed diagnostic information during execution: - Top feature correlations - Features dropped in Stage 1 (correlation filtering) - Cross-validation progress and best parameters - Features dropped in Stage 2 (elastic net) - Top coefficients by absolute value - Final summary statistics
Examples
>>> result = selector.execute() >>> print(f"Selected {len(result.selected_feature_columns)} features") >>> print(f"Updated similarity map: {result.updated_similarity_map}")