preprocess package

add_drop_column module

app_streamlit.load_data.preprocess.add_drop_column.add_columns(df_target, df_source, key_target, key_source, columns_to_add)

Adds specified columns from a source DataFrame to a target DataFrame using specified keys, without including the key column from the source DataFrame in the final DataFrame and avoiding unwanted columns.

Args: df_target (pd.DataFrame): Target DataFrame where the columns will be added. df_source (pd.DataFrame): Source DataFrame containing the columns to be added. key_target (str): Name of the key column in the target DataFrame. key_source (str): Name of the key column in the source DataFrame. columns_to_add (list): List of columns to add from the source DataFrame.

Returns: pd.DataFrame: The target DataFrame with the new columns added.

app_streamlit.load_data.preprocess.add_drop_column.drop_columns(df, columns_to_drop)

Function that drops columns.

Args:: df : dataframe columns_to_drop (list, string): name of the column(s) to drop
Returns:: df: new dataframe without the columns that weren’t needed

clean_dataframe module

app_streamlit.load_data.preprocess.clean_dataframe.prepare_final_dataframe(raw_interaction, raw_recipes, pp_recipes)

Prepare a new clean dataframe, that will be used for the analysis, by using other functions.

Args:: raw_interaction (DataFrame): raw dataFrame of interactions from users raw_recipes (DataFrame): raw dataFrame with recipes informations pp_recipes (DataFrame): recipies dataFrame preprocessed
Returns:: df_merged (DataFrame): final dataFrame

cleaning_data module

app_streamlit.load_data.preprocess.cleaning_data.add_season(df): Add a season column to the dataset

app_streamlit.load_data.preprocess.cleaning_data.date_separated(col_name, dataframe)

This function takes a column with a date in the string format YYYY-MM-DD and returns the dataframe with 3 new columns for the day, month, and year.

Args:: col_name (string): Name of the column with the date in the dataframe. dataframe : pandas.DataFrame
Returns:: dataframe : DataFrame with additional columns for day, month, and year.

app_streamlit.load_data.preprocess.cleaning_data.outliers_df(dataframe, column, treshold_sup=None, treshold_inf=None, get_info=False)

Function that returns a list of all outliers in a column depending on the threshold.

Args:: dataframe : pandas.DataFrame column (string) : name of the column treshold_sup (int,float, optional): threshold for the outliers superior to a value. Defaults to None. treshold_inf (int,float, optional): threshold for the outliers inferior to a value. Defaults to None. get_info (bool, optional): If True, returns a dataframe with all outliers else just a list of outliers. Defaults to False.
Returns:: outliers: DataFrame or list of outliers based on get_info.

app_streamlit.load_data.preprocess.cleaning_data.remove_outliers_iqr(df, column)

merging module

app_streamlit.load_data.preprocess.merging.dataframe_concat(df, key, join='left')

fonction to merge two dataframes on one column (by default with a left join).

Args:: df (list): list with 2 dataframes to concatenate key (list): name of the column(s) to join the df join (string) : type of the join (left, right, outer, inner)
Returns:: df_merged: new dataframe merged on 1 or more columns with a specific join

normalisation module

app_streamlit.load_data.preprocess.normalisation.normalisation(df, column_name)

Normalizes a numeric column in the DataFrame using MinMaxScaler and adds a new column with the normalized values.

Args:: df (pd.DataFrame): The DataFrame containing the column to normalize. column_name (str): The name of the column to normalize.
Returns:: df (pd.DataFrame): DataFrame with an additional column containing the normalized values.

df_aggregate module

app_streamlit.load_data.preprocess.df_aggregate.df_aggregate(df)

Aggregates data to have one row per recipe_id, with the original columns (excluding ‘user_id’) plus: - num_comments: Number of unique users who commented on the recipe. - avg_reviews_per_user: Total number of reviews for the recipe.

Args:

df (pd.DataFrame): DataFrame containing recipe data, including ‘recipe_id’, ‘user_id’, and ‘review’.

Returns:

pd.DataFrame: Aggregated DataFrame with one row per recipe_id, original columns (excluding ‘user_id’),: and additional metrics.