Prepare

class neer_match_utilities.prepare.Prepare(similarity_map, df_left, df_right, id_left, id_right, spacy_pipeline='', additional_stop_words=[])[source]

A class for preparing and processing data based on similarity mappings.

The Prepare class inherits from SuperClass and provides functionality to clean, preprocess, and align two pandas DataFrames (df_left and df_right) based on a given similarity map. This is useful for data cleaning and ensuring data compatibility before comparison or matching operations.

Attributes:

similarity_mapdict

A dictionary defining column mappings between the left and right DataFrames.

df_leftpandas.DataFrame

The left DataFrame to be processed.

df_rightpandas.DataFrame

The right DataFrame to be processed.

id_leftstr

Column name representing unique IDs in the left DataFrame.

id_rightstr

Column name representing unique IDs in the right DataFrame.

spacy_pipelinestr

Name of the spaCy model loaded for NLP tasks (e.g., “en_core_web_sm”). If empty, no spaCy pipeline is used. (see https://spacy.io/models for avaiable models)

additional_stop_wordslist of str

Extra tokens to mark as stop-words in the spaCy pipeline.

__init__(similarity_map, df_left, df_right, id_left, id_right, spacy_pipeline='', additional_stop_words=[])[source]
do_remove_stop_words(text)[source]

Removes stop words and non-alphabetic tokens from text.

Return type:

str

Attributes:

textstr

The input text to process.

Returns:

: str

A space-separated string of unique lemmas after tokenization, lemmatization, and duplicate removal.

format(fill_numeric_na=False, to_numeric=[], fill_string_na=False, capitalize=False, lower_case=False, remove_stop_words=False)[source]

Cleans, processes, and aligns the columns of two DataFrames (df_left and df_right).

This method applies transformations based on column mappings defined in similarity_map. It handles numeric and string conversions, fills missing values, and ensures consistent data types between the columns of the two DataFrames.

Parameters:
  • fill_numeric_na (bool, optional) – If True, fills missing numeric values with 0 before conversion to numeric dtype. Default is False.

  • to_numeric (list, optional) – A list of column names to be converted to numeric dtype. Default is an empty list.

  • fill_string_na (bool, optional) – If True, fills missing string values with empty strings. Default is False.

  • capitalize (bool, optional) – If True, capitalizes string values in non-numeric columns. Default is False.

  • lower_case (bool, optional) – If True, uses lower-case string values in non-numeric columns. Default is False.

  • remove_stop_words (bool, optional) – If True, applies stop-word removal and lemmatization to non-numeric columns using the do_remove_stop_words method. Importantly, this only works if a proper Spacy pipeline is defined when initializing the Prepare object. Default is False.

Returns:

A tuple containing the processed left (df_left_processed) and right (df_right_processed) DataFrames.

Return type:

tuple[pandas.DataFrame, pandas.DataFrame]

Notes

  • Columns are processed and aligned according to the similarity_map:
    • If both columns are numeric, their types are aligned.

    • If types differ, columns are converted to strings while preserving NaN.

  • Supports flexible handling of missing values and type conversions.

neer_match_utilities.prepare.similarity_map_to_dict(items)[source]

Convert a list of similarity mappings into a dictionary representation.

The function accepts a list of tuples, where each tuple represents a mapping with the form (left, right, similarity). If the left and right column names are identical, the dictionary key is that column name; otherwise, the key is formed as left~right.

Returns:

A dictionary where keys are column names (or left~right for differing columns) and values are lists of similarity functions associated with those columns.

Return type:

dict

neer_match_utilities.prepare.synth_mismatches(right, columns_fix, columns_change, str_metric, str_similarity_range, pct_diff_range, n_cols, n_mismatches=1, keep_missing=True, nan_share=0.0, empty_share=0.0, sample_share=1.0, id_right=None)[source]
Generates synthetic mismatches for a share of observations in right. Returns:
  • All original rows from right (unchanged),

  • Plus new synthetic mismatches (with new UUID4 IDs if id_right is specified), for a random subset of original rows of size = floor(sample_share * len(right)). Drops any synthetic row whose data‐portion duplicates an original right row or duplicates another synthetic row.

STRING‐columns in columns_change:
  • require str_similarity = normalized_similarity(orig, candidate) within [min_str_sim, max_str_sim].

  • if no candidate qualifies, perturb orig until similarity is in that range.

NUMERIC‐columns in columns_change:
  • require percentage difference │orig - candidate│/│orig│ within [min_pct_diff, max_pct_diff]. (If orig == 0, any candidate ≠ 0 counts as pct_diff = 1.0.)

  • if no candidate qualifies, perturb orig until percentage difference is in that range, by taking orig * (1 ± min_pct_diff) or (1 ± max_pct_diff).

keep_missing: if True, any NaN or “” in the original right row’s columns_change is preserved (no change). nan_share/empty_share: after generating all synthetics and deduplicating,

inject NaN or “” into columns_change at the given probabilities.

Parameters:
  • right (pd.DataFrame) – The DataFrame containing the “true” observations.

  • id_right (str or None) – Column name of the unique ID in right. If None, no ID column is created for synthetic rows.

  • columns_fix (list of str) – Columns whose values remain unchanged (copied directly from the original row).

  • columns_change (list of str) – Columns whose values are modified to create mismatches.

  • str_metric (str) – Name of the string‐similarity metric (key in available_similarities()).

  • str_similarity_range (tuple (min_str_sim, max_str_sim)) – Range of allowed normalized_similarity (0–1). Candidate strings must satisfy min_str_sim ≤ similarity(orig, candidate) ≤ max_str_sim.

  • pct_diff_range (tuple (min_pct_diff, max_pct_diff)) – For numeric columns: percentage difference │orig - candidate│/│orig│ must lie in [min_pct_diff, max_pct_diff]. (pct_diff_range values are between 0.0 and 1.0.)

  • n_cols (int) – How many of the columns_change to modify per synthetic row. If n_cols < len(columns_change), pick that many at random; if n_cols > len(columns_change), warn and modify all.

  • n_mismatches (int) – How many synthetic mismatches to generate per each selected original right row.

  • keep_missing (bool) – If True, any NaN or “” in the original row’s columns_change is preserved (no change).

  • nan_share (float in [0,1]) – After deduplication, each synthetic cell in columns_change has probability nan_share → NaN.

  • empty_share (float in [0,1]) – After deduplication, each synthetic cell in columns_change has probability empty_share → “”. (Applied after nan_share.)

  • sample_share (float in [0,1], default=1.0) – Proportion of original right rows to select at random for synthetics. If 1.0, all rows. If 0.5, floor(0.5 * n_rows) are chosen.

Returns:

Expanded DataFrame with original + synthetic rows.

Return type:

pd.DataFrame