Feature Selection

Feature selection utilities for similarity-based entity matching.

This module provides tools for selecting the most informative similarity features from a potentially large set of candidates. It implements a two-stage feature selection process:

Correlation-based filtering: Removes highly correlated features, keeping the one most correlated with the target variable.
Elastic net regularization: Uses penalized logistic regression to identify features that contribute unique predictive information.

The feature selector is designed to handle extreme class imbalance, which is common in entity matching tasks where true matches are rare.

class neer_match_utilities.feature_selection.FeatureSelectionResult(updated_similarity_map, selected_feature_columns, selected_pairs, coef_by_feature, meta)[source]

Result object returned by FeatureSelector.execute().

updated_similarity_map

Reduced similarity map containing only selected features. Keys are variable names, values are lists of similarity concept names.

Type:: Dict[str, List[str]]

selected_feature_columns

Names of selected feature columns in the format ‘col_{var}_{var}_{similarity}’.

Type:: List[str]

selected_pairs

List of (variable, similarity_concept) tuples that were selected.

Type:: List[Tuple[str, str]]

coef_by_feature

Coefficients from the final elastic net model, sorted by absolute value. Useful for understanding feature importance.

Type:: pd.Series

meta

Metadata about the selection process, including: - method: Feature selection method used - scoring: Scoring metric used for cross-validation - cv: Number of cross-validation folds - n_features_in: Number of input features - n_features_selected: Number of features selected - did_fallback: Whether fallback to original map occurred

Type:: Dict[str, Any]

__init__(updated_similarity_map, selected_feature_columns, selected_pairs, coef_by_feature, meta)

class neer_match_utilities.feature_selection.FeatureSelector(similarity_map, training_data, *, id_left_col='id', id_right_col='id', matches_id_left='left', matches_id_right='right', match_col='match', matches_are_indices=True, method='elastic_net', scoring='average_precision', cv=5, l1_ratios=(0.8, 0.9, 1.0), Cs=20, class_weight='balanced', max_iter=5000, random_state=42, n_jobs=-1, min_coef_threshold=0.0, max_correlation=None, always_keep=None, preferred_separators=('__', '|', ':', '-', '_'))[source]

Supervised feature selector for similarity-based entity matching.

This class implements a two-stage feature selection process optimized for entity matching tasks with extreme class imbalance:

Stage 1: Correlation-based filtering (optional): Removes redundant features by identifying groups of highly correlated features and keeping only the one most correlated with the target.
Stage 2: Elastic net regularization: Uses penalized logistic regression with L1/L2 penalties to identify features that contribute unique predictive information. Cross-validation is used to select optimal regularization parameters.

The selector returns an updated similarity map containing only the features that passed both selection stages.

Parameters:

similarity_map (Dict[str, List[str]] or SimilarityMap) – Mapping from variable names to lists of similarity concepts. Example: {‘name’: [‘jaro_winkler’, ‘levenshtein’], ‘address’: [‘cosine’]}
training_data (Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]) – Three-tuple of (left_df, right_df, matches_df) used for feature selection.
id_left_col (str, default='id') – Column names containing entity IDs in left and right dataframes.
id_right_col (str, default='id') – Column names containing entity IDs in left and right dataframes.
matches_id_left (str, default='left', 'right') – Column names in matches_df identifying left/right entity IDs.
matches_id_right (str, default='left', 'right') – Column names in matches_df identifying left/right entity IDs.
match_col (str, default='match') – Name of the binary match indicator column (1=match, 0=non-match).
matches_are_indices (bool, default=True) – If True, treat match IDs as integer row indices. If False, treat as entity IDs.
method (str, default='elastic_net') – Feature selection method. Currently only ‘elastic_net’ is supported.
scoring (str, default='average_precision') – Scoring metric for cross-validation. Options: ‘f1’, ‘roc_auc’, ‘average_precision’, ‘neg_log_loss’. For imbalanced data, ‘average_precision’ is recommended.
cv (int, default=5) – Number of cross-validation folds.
l1_ratios (tuple, default=(0.8, 0.9, 1.0)) – L1 penalty ratios to try. 1.0 = pure Lasso, 0.0 = pure Ridge.
Cs (int or Iterable[float], default=20) – Regularization strengths to try. If int, generates Cs values logarithmically spaced from 0.01 to 1000. Higher C = less regularization.
class_weight (str or None, default='balanced') – Class weighting strategy. ‘balanced’ adjusts weights inversely proportional to class frequencies, recommended for imbalanced data.
max_iter (int, default=5000) – Maximum iterations for elastic net solver.
random_state (int, default=42) – Random seed for reproducibility.
n_jobs (int, default=-1) – Number of parallel jobs. -1 uses all available processors.
min_coef_threshold (float, default=0.0) – Minimum absolute coefficient value for feature retention. Features with abs(coef) < threshold are dropped. Set to 0.0 to keep all non-zero features.
max_correlation (float or None, default=None) – Correlation threshold for Stage 1 filtering. Features with pairwise correlation > threshold are candidates for removal. Example: 0.95. If None, Stage 1 is skipped.
always_keep (Dict[str, List[str]] or None, default=None) – Features to always retain regardless of selection results. Example: {‘surname’: [‘jaro_winkler’]} always keeps surname jaro_winkler.
preferred_separators (tuple, default=('__', '|', ':', '-', '_')) – Separators to try when parsing feature names (internal use).

similarity_map

The input similarity map.

Type:: Dict[str, List[str]]

left_train, right_train, matches_train

Training datasets for feature selection.

Type:: pd.DataFrame

Examples

>>> from neer_match_utilities import FeatureSelector
>>> selector = FeatureSelector(
...     similarity_map={'name': ['jaro_winkler', 'levenshtein', 'cosine'],
...                     'address': ['jaro_winkler', 'levenshtein']},
...     training_data=(left_df, right_df, matches_df),
...     max_correlation=0.95,
...     min_coef_threshold=0.01
... )
>>> result = selector.execute()
>>> print(result.updated_similarity_map)
{'name': ['jaro_winkler'], 'address': ['levenshtein']}

__init__(similarity_map, training_data, *, id_left_col='id', id_right_col='id', matches_id_left='left', matches_id_right='right', match_col='match', matches_are_indices=True, method='elastic_net', scoring='average_precision', cv=5, l1_ratios=(0.8, 0.9, 1.0), Cs=20, class_weight='balanced', max_iter=5000, random_state=42, n_jobs=-1, min_coef_threshold=0.0, max_correlation=None, always_keep=None, preferred_separators=('__', '|', ':', '-', '_'))[source]

execute()[source]

Execute the two-stage feature selection process.

This method performs: 1. Builds pairwise similarity features from training data 2. Cleans data (removes constant columns, handles missing values) 3. Stage 1 (optional): Correlation-based filtering 4. Scales features for regularization 5. Stage 2: Elastic net cross-validation and feature selection 6. Applies coefficient thresholding (if configured) 7. Converts selected features back to similarity map format

Returns:: Object containing the updated similarity map, selected features, coefficients, and metadata about the selection process.
Return type:: FeatureSelectionResult
Raises:: ValueError – If the method is not ‘elastic_net’, or if no usable features remain after cleaning or correlation filtering, or if there are too few positive examples for cross-validation.

Notes

The method prints detailed diagnostic information during execution: - Top feature correlations - Features dropped in Stage 1 (correlation filtering) - Cross-validation progress and best parameters - Features dropped in Stage 2 (elastic net) - Top coefficients by absolute value - Final summary statistics

Examples

>>> result = selector.execute()
>>> print(f"Selected {len(result.selected_feature_columns)} features")
>>> print(f"Updated similarity map: {result.updated_similarity_map}")

neer_match_utilities.feature_selection.tqdm_joblib(tqdm_object)[source]

Context manager to integrate tqdm progress bars with joblib parallel execution.

Parameters:: tqdm_object (tqdm.tqdm or None) – Progress bar object to update during parallel execution.