utils.py API
Utility functions for the data visualization application.
Provides data analysis, cleaning, interaction with Gemini and Claude, and plot generation functionalities.
- utils.analyze_data(df: DataFrame) dict[source]
Analyzes the DataFrame for data quality issues (missing values, duplicates, data types).
- Parameters:
df – The input pandas DataFrame.
- Returns:
A dictionary containing analysis results.
- utils.apply_fixes_to_data(df: DataFrame) tuple[DataFrame, str][source]
Applies basic data cleaning (removes duplicates, fills numerical NaNs with mean).
- Parameters:
df – The input pandas DataFrame.
- Returns:
(cleaned DataFrame, JSON string summary of fixes).
- Return type:
A tuple
- utils.create_dataset_summary(df: DataFrame) str[source]
Generates a textual summary of the dataset.
- Parameters:
df – The input pandas DataFrame.
- Returns:
A string containing the dataset summary.
- utils.exec_code_to_generate_plot(code_str, df)[source]
Executes Python code (provided as a string) to generate a plot.
- This function takes code generated by an LLM and tries to run it. It includes
workarounds for common errors in LLM-generated plotting code. The generated plot is returned as a base64-encoded PNG image.
- Args:
code_str (str): The Python code to execute (base64 encoded). df (pandas.DataFrame): The DataFrame to use for plotting.
- Returns:
- str: A base64-encoded string representing the generated plot image, or
None if an error occurred during code execution.
- utils.filter_data_by_top_variables(df: DataFrame, column_name: str, top_n_variables: list) DataFrame[source]
Filters the DataFrame to include only rows with top N values in a column.
- Parameters:
df – pd.DataFrame, input dataframe
column_name – str, column name
top_n_variables – list, top n variables
- Returns:
pd.DataFrame, filtered dataframe
- utils.fix_json(json_str: str) str[source]
Attempts to fix common JSON formatting errors (trailing commas).
- Parameters:
json_str – The potentially malformed JSON string.
- Returns:
The fixed JSON string, or the original if no fixes were applied.
- utils.generate_graph_interpretation_claude(suggestion_text: str, dataset_summary: str, api_key: str) str[source]
Generates a graph interpretation using the Claude language model.
- utils.generate_graph_interpretation_gemini(suggestion_text: str, dataset_summary: str, api_key: str) str[source]
Generates a graph interpretation using the Gemini language model.
- utils.get_plot_suggestion_from_claude(df: DataFrame, api_key: str) list | None[source]
Gets plot suggestions and Python code from Claude, with retries.
- utils.get_plot_suggestion_from_gemini(df: DataFrame, api_key: str) list | None[source]
Gets plot suggestions and Python code from Gemini, with retries.
- utils.get_top_n_variables(df: DataFrame, column_name: str, n: int = 10) list[source]
Gets the top N most frequent values in a column.
- Parameters:
df – The input DataFrame.
column_name – The name of the column.
n – The number of top values to retrieve.
- Returns:
A list of the top N most frequent values.