utils.py API

Utility functions for the data visualization application.

Provides data analysis, cleaning, interaction with Gemini and Claude, and plot generation functionalities.

utils.analyze_data(df: DataFrame) dict[source]

Analyzes the DataFrame for data quality issues (missing values, duplicates, data types).

Parameters:

df – The input pandas DataFrame.

Returns:

A dictionary containing analysis results.

utils.apply_fixes_to_data(df: DataFrame) tuple[DataFrame, str][source]

Applies basic data cleaning (removes duplicates, fills numerical NaNs with mean).

Parameters:

df – The input pandas DataFrame.

Returns:

(cleaned DataFrame, JSON string summary of fixes).

Return type:

A tuple

utils.create_dataset_summary(df: DataFrame) str[source]

Generates a textual summary of the dataset.

Parameters:

df – The input pandas DataFrame.

Returns:

A string containing the dataset summary.

utils.exec_code_to_generate_plot(code_str, df)[source]

Executes Python code (provided as a string) to generate a plot.

This function takes code generated by an LLM and tries to run it. It includes

workarounds for common errors in LLM-generated plotting code. The generated plot is returned as a base64-encoded PNG image.

Args:

code_str (str): The Python code to execute (base64 encoded). df (pandas.DataFrame): The DataFrame to use for plotting.

Returns:
str: A base64-encoded string representing the generated plot image, or

None if an error occurred during code execution.

utils.filter_data_by_top_variables(df: DataFrame, column_name: str, top_n_variables: list) DataFrame[source]

Filters the DataFrame to include only rows with top N values in a column.

Parameters:
  • df – pd.DataFrame, input dataframe

  • column_name – str, column name

  • top_n_variables – list, top n variables

Returns:

pd.DataFrame, filtered dataframe

utils.fix_json(json_str: str) str[source]

Attempts to fix common JSON formatting errors (trailing commas).

Parameters:

json_str – The potentially malformed JSON string.

Returns:

The fixed JSON string, or the original if no fixes were applied.

utils.generate_graph_interpretation_claude(suggestion_text: str, dataset_summary: str, api_key: str) str[source]

Generates a graph interpretation using the Claude language model.

utils.generate_graph_interpretation_gemini(suggestion_text: str, dataset_summary: str, api_key: str) str[source]

Generates a graph interpretation using the Gemini language model.

utils.get_plot_suggestion_from_claude(df: DataFrame, api_key: str) list | None[source]

Gets plot suggestions and Python code from Claude, with retries.

utils.get_plot_suggestion_from_gemini(df: DataFrame, api_key: str) list | None[source]

Gets plot suggestions and Python code from Gemini, with retries.

utils.get_top_n_variables(df: DataFrame, column_name: str, n: int = 10) list[source]

Gets the top N most frequent values in a column.

Parameters:
  • df – The input DataFrame.

  • column_name – The name of the column.

  • n – The number of top values to retrieve.

Returns:

A list of the top N most frequent values.

utils.handle_graph_communication_claude(image_data: bytes, dataset_str: str, user_message: str, api_key: str) str[source]

Handles user-model communication about a graph image using Claude.

utils.handle_graph_communication_gemini(image_data: bytes, dataset_str: str, user_message: str, api_key: str) str[source]

Handles user-model communication about a graph image using Gemini.