Prepare

class neer_match_utilities.prepare.Prepare(similarity_map, df_left=Empty DataFrame Columns: [] Index: [], df_right=Empty DataFrame Columns: [] Index: [], id_left='id', id_right='id')[source]

A class for preparing and processing data based on similarity mappings.

The Prepare class inherits from SuperClass and provides functionality to clean, preprocess, and align two pandas DataFrames (df_left and df_right) based on a given similarity map. This is useful for data cleaning and ensuring data compatibility before comparison or matching operations.

Attributes:

similarity_mapdict: A dictionary defining column mappings between the left and right DataFrames.
df_leftpandas.DataFrame: The left DataFrame to be processed.
df_rightpandas.DataFrame: The right DataFrame to be processed.
id_leftstr: Column name representing unique IDs in the left DataFrame.
id_rightstr: Column name representing unique IDs in the right DataFrame.

format(fill_numeric_na=False, to_numeric=[], fill_string_na=False, capitalize=False)[source]

Cleans, processes, and aligns the columns of two DataFrames (df_left and df_right).

This method applies transformations based on column mappings defined in similarity_map. It handles numeric and string conversions, fills missing values, and ensures consistent data types between the columns of the two DataFrames.

Parameters:

fill_numeric_na (bool, optional) – If True, fills missing numeric values with 0 before conversion to numeric dtype. Default is False.
to_numeric (list, optional) – A list of column names to be converted to numeric dtype. Default is an empty list.
fill_string_na (bool, optional) – If True, fills missing string values with empty strings. Default is False.
capitalize (bool, optional) – If True, capitalizes string values in non-numeric columns. Default is False.

Returns:

A tuple containing the processed left (df_left_processed) and right (df_right_processed) DataFrames.

Return type:

tuple[pandas.DataFrame, pandas.DataFrame]

Notes

Columns are processed and aligned according to the similarity_map:
- If both columns are numeric, their types are aligned.
- If types differ, columns are converted to strings while preserving NaN.
Supports flexible handling of missing values and type conversions.

neer_match_utilities.prepare.similarity_map_to_dict(items)[source]

Convert a list of similarity mappings into a dictionary representation.

The function accepts a list of tuples, where each tuple represents a mapping with the form (left, right, similarity). If the left and right column names are identical, the dictionary key is that column name; otherwise, the key is formed as left~right.

Returns:: A dictionary where keys are column names (or left~right for differing columns) and values are lists of similarity functions associated with those columns.
Return type:: dict