Prepare
- class neer_match_utilities.prepare.Prepare(similarity_map, df_left, df_right, id_left, id_right, spacy_pipeline='', additional_stop_words=[])[source]
A class for preparing and processing data based on similarity mappings.
The Prepare class inherits from SuperClass and provides functionality to clean, preprocess, and align two pandas DataFrames (df_left and df_right) based on a given similarity map. This is useful for data cleaning and ensuring data compatibility before comparison or matching operations.
Attributes:
- similarity_mapdict
A dictionary defining column mappings between the left and right DataFrames.
- df_leftpandas.DataFrame
The left DataFrame to be processed.
- df_rightpandas.DataFrame
The right DataFrame to be processed.
- id_leftstr
Column name representing unique IDs in the left DataFrame.
- id_rightstr
Column name representing unique IDs in the right DataFrame.
- spacy_pipelinestr
Name of the spaCy model loaded for NLP tasks (e.g., “en_core_web_sm”). If empty, no spaCy pipeline is used. (see https://spacy.io/models for avaiable models)
- additional_stop_wordslist of str
Extra tokens to mark as stop-words in the spaCy pipeline.
- __init__(similarity_map, df_left, df_right, id_left, id_right, spacy_pipeline='', additional_stop_words=[])[source]
- do_remove_stop_words(text)[source]
Removes stop words and non-alphabetic tokens from text.
- Return type:
str
Attributes:
- textstr
The input text to process.
Returns:
: str
A space-separated string of unique lemmas after tokenization, lemmatization, and duplicate removal.
- format(fill_numeric_na=False, to_numeric=[], fill_string_na=False, capitalize=False, lower_case=False, remove_stop_words=False)[source]
Cleans, processes, and aligns the columns of two DataFrames (df_left and df_right).
This method applies transformations based on column mappings defined in similarity_map. It handles numeric and string conversions, fills missing values, and ensures consistent data types between the columns of the two DataFrames.
- Parameters:
fill_numeric_na (bool, optional) – If True, fills missing numeric values with 0 before conversion to numeric dtype. Default is False.
to_numeric (list, optional) – A list of column names to be converted to numeric dtype. Default is an empty list.
fill_string_na (bool, optional) – If True, fills missing string values with empty strings. Default is False.
capitalize (bool, optional) – If True, capitalizes string values in non-numeric columns. Default is False.
lower_case (bool, optional) – If True, uses lower-case string values in non-numeric columns. Default is False.
remove_stop_words (bool, optional) – If True, applies stop-word removal and lemmatization to non-numeric columns using the do_remove_stop_words method. Importantly, this only works if a proper Spacy pipeline is defined when initializing the Prepare object. Default is False.
- Returns:
A tuple containing the processed left (df_left_processed) and right (df_right_processed) DataFrames.
- Return type:
tuple[pandas.DataFrame, pandas.DataFrame]
Notes
- Columns are processed and aligned according to the similarity_map:
If both columns are numeric, their types are aligned.
If types differ, columns are converted to strings while preserving NaN.
Supports flexible handling of missing values and type conversions.
- neer_match_utilities.prepare.similarity_map_to_dict(items)[source]
Convert a list of similarity mappings into a dictionary representation.
The function accepts a list of tuples, where each tuple represents a mapping with the form (left, right, similarity). If the left and right column names are identical, the dictionary key is that column name; otherwise, the key is formed as left~right.
- Returns:
A dictionary where keys are column names (or left~right for differing columns) and values are lists of similarity functions associated with those columns.
- Return type:
dict
- neer_match_utilities.prepare.synth_mismatches(right, columns_fix, columns_change, str_metric, str_similarity_range, pct_diff_range, n_cols, n_mismatches=1, keep_missing=True, nan_share=0.0, empty_share=0.0, sample_share=1.0, id_right=None)[source]
- Generates synthetic mismatches for a share of observations in right. Returns:
All original rows from right (unchanged),
Plus new synthetic mismatches (with new UUID4 IDs if id_right is specified), for a random subset of original rows of size = floor(sample_share * len(right)). Drops any synthetic row whose data‐portion duplicates an original right row or duplicates another synthetic row.
- STRING‐columns in columns_change:
require str_similarity = normalized_similarity(orig, candidate) within [min_str_sim, max_str_sim].
if no candidate qualifies, perturb orig until similarity is in that range.
- NUMERIC‐columns in columns_change:
require percentage difference │orig - candidate│/│orig│ within [min_pct_diff, max_pct_diff]. (If orig == 0, any candidate ≠ 0 counts as pct_diff = 1.0.)
if no candidate qualifies, perturb orig until percentage difference is in that range, by taking orig * (1 ± min_pct_diff) or (1 ± max_pct_diff).
keep_missing: if True, any NaN or “” in the original right row’s columns_change is preserved (no change). nan_share/empty_share: after generating all synthetics and deduplicating,
inject NaN or “” into columns_change at the given probabilities.
- Parameters:
right (pd.DataFrame) – The DataFrame containing the “true” observations.
id_right (str or None) – Column name of the unique ID in right. If None, no ID column is created for synthetic rows.
columns_fix (list of str) – Columns whose values remain unchanged (copied directly from the original row).
columns_change (list of str) – Columns whose values are modified to create mismatches.
str_metric (str) – Name of the string‐similarity metric (key in available_similarities()).
str_similarity_range (tuple (min_str_sim, max_str_sim)) – Range of allowed normalized_similarity (0–1). Candidate strings must satisfy min_str_sim ≤ similarity(orig, candidate) ≤ max_str_sim.
pct_diff_range (tuple (min_pct_diff, max_pct_diff)) – For numeric columns: percentage difference │orig - candidate│/│orig│ must lie in [min_pct_diff, max_pct_diff]. (pct_diff_range values are between 0.0 and 1.0.)
n_cols (int) – How many of the columns_change to modify per synthetic row. If n_cols < len(columns_change), pick that many at random; if n_cols > len(columns_change), warn and modify all.
n_mismatches (int) – How many synthetic mismatches to generate per each selected original right row.
keep_missing (bool) – If True, any NaN or “” in the original row’s columns_change is preserved (no change).
nan_share (float in [0,1]) – After deduplication, each synthetic cell in columns_change has probability nan_share → NaN.
empty_share (float in [0,1]) – After deduplication, each synthetic cell in columns_change has probability empty_share → “”. (Applied after nan_share.)
sample_share (float in [0,1], default=1.0) – Proportion of original right rows to select at random for synthetics. If 1.0, all rows. If 0.5, floor(0.5 * n_rows) are chosen.
- Returns:
Expanded DataFrame with original + synthetic rows.
- Return type:
pd.DataFrame