preprocess package
add_drop_column module
- app_streamlit.load_data.preprocess.add_drop_column.add_columns(df_target, df_source, key_target, key_source, columns_to_add)
Adds specified columns from a source DataFrame to a target DataFrame using specified keys, without including the key column from the source DataFrame in the final DataFrame and avoiding unwanted columns.
Args: df_target (pd.DataFrame): Target DataFrame where the columns will be added. df_source (pd.DataFrame): Source DataFrame containing the columns to be added. key_target (str): Name of the key column in the target DataFrame. key_source (str): Name of the key column in the source DataFrame. columns_to_add (list): List of columns to add from the source DataFrame.
Returns: pd.DataFrame: The target DataFrame with the new columns added.
- app_streamlit.load_data.preprocess.add_drop_column.drop_columns(df, columns_to_drop)
Function that drops columns.
- Args:
df : dataframe columns_to_drop (list, string): name of the column(s) to drop
- Returns:
df: new dataframe without the columns that weren’t needed
clean_dataframe module
- app_streamlit.load_data.preprocess.clean_dataframe.prepare_final_dataframe(raw_interaction, raw_recipes, pp_recipes)
Prepare a new clean dataframe, that will be used for the analysis, by using other functions.
- Args:
raw_interaction (DataFrame): raw dataFrame of interactions from users raw_recipes (DataFrame): raw dataFrame with recipes informations pp_recipes (DataFrame): recipies dataFrame preprocessed
- Returns:
df_merged (DataFrame): final dataFrame
cleaning_data module
- app_streamlit.load_data.preprocess.cleaning_data.add_season(df)
Add a season column to the dataset
- app_streamlit.load_data.preprocess.cleaning_data.date_separated(col_name, dataframe)
This function takes a column with a date in the string format YYYY-MM-DD and returns the dataframe with 3 new columns for the day, month, and year.
- Args:
col_name (string): Name of the column with the date in the dataframe. dataframe : pandas.DataFrame
- Returns:
dataframe : DataFrame with additional columns for day, month, and year.
- app_streamlit.load_data.preprocess.cleaning_data.outliers_df(dataframe, column, treshold_sup=None, treshold_inf=None, get_info=False)
Function that returns a list of all outliers in a column depending on the threshold.
- Args:
dataframe : pandas.DataFrame column (string) : name of the column treshold_sup (int,float, optional): threshold for the outliers superior to a value. Defaults to None. treshold_inf (int,float, optional): threshold for the outliers inferior to a value. Defaults to None. get_info (bool, optional): If True, returns a dataframe with all outliers else just a list of outliers. Defaults to False.
- Returns:
outliers: DataFrame or list of outliers based on get_info.
- app_streamlit.load_data.preprocess.cleaning_data.remove_outliers_iqr(df, column)
merging module
- app_streamlit.load_data.preprocess.merging.dataframe_concat(df, key, join='left')
fonction to merge two dataframes on one column (by default with a left join).
- Args:
df (list): list with 2 dataframes to concatenate key (list): name of the column(s) to join the df join (string) : type of the join (left, right, outer, inner)
- Returns:
df_merged: new dataframe merged on 1 or more columns with a specific join
normalisation module
- app_streamlit.load_data.preprocess.normalisation.normalisation(df, column_name)
Normalizes a numeric column in the DataFrame using MinMaxScaler and adds a new column with the normalized values.
- Args:
df (pd.DataFrame): The DataFrame containing the column to normalize. column_name (str): The name of the column to normalize.
- Returns:
df (pd.DataFrame): DataFrame with an additional column containing the normalized values.
df_aggregate module
- app_streamlit.load_data.preprocess.df_aggregate.df_aggregate(df)
Aggregates data to have one row per recipe_id, with the original columns (excluding ‘user_id’) plus: - num_comments: Number of unique users who commented on the recipe. - avg_reviews_per_user: Total number of reviews for the recipe.
- Args:
df (pd.DataFrame): DataFrame containing recipe data, including ‘recipe_id’, ‘user_id’, and ‘review’.
- Returns:
- pd.DataFrame: Aggregated DataFrame with one row per recipe_id, original columns (excluding ‘user_id’),
and additional metrics.