Exploratory Data Analysis with Claude
In this project, we’ll discuss how you can enhance your exploratory data analysis (EDA) experience by integrating a Large Language Model (such as Claude) with Pandas DataFrame and the Agent Toolkit. Let’s dive in!

Getting Started: Installing Necessary Libraries
To kick things off, we need to set up our environment by downloading and installing some essential libraries and packages. Here’s what you’ll need:
- LangChain
- Anthropic
- Matplotlib (for plotting graphs)
- Pandas
- openpyxl
- tabulate

Once you have these libraries installed, we can proceed to the next step: setting up our model and agent.
Importing Libraries and Setting Up the Environment
Start by importing the necessary libraries to perform our operations. You’ll want to import the following:
from langchain_anthropic import ChatAnthropic
from langchain_experimental.agents import create_pandas_dataframe_agent
import pandas as pd
Next, you’ll need to set up your Anthropic API key. If you’re using an IDE, create a .env file to store your API key securely.
Using the Dataset
For our analysis, we’ll utilize the Data Science Salaries 2023 dataset, which consists of approximately 11 columns, each with detailed descriptions. Upload the dataset into your Jupyter Notebook, naming the file ds_salaries.csv.
To load the dataset into a DataFrame, provide the path to your CSV file:
To verify that the dataset has been loaded correctly, run:
This command will display the first five rows of your dataset, confirming a successful load.
Initializing the Agent
Now, it’s time to initialize our agent. Set the temperature parameter to zero to ensure that the responses are straightforward and not overly creative. We’ll be using the Claude 3.5 Sonnet model for our analysis.
Here’s how to initialize the agent:
llm=ChatAnthropic(temperature=0,model_name=”claude-3-5-sonnet-20240620″)
agent = create_pandas_dataframe_agent(llm, df, verbose=True,allow_dangerous_code=True)
With the agent initialized, we can now run various queries on our dataset.
Running Queries
To interact with the dataset, we’ll use the agent.run method. Let’s start with a basic query to find out how many rows and columns are in our dataset:
Upon executing this command, the agent will provide the answer: 3755 rows and 11 columns.

Data Cleaning
Before diving deeper, it’s crucial to ensure our data is clean. To check for any missing values, run:

The output will indicate that there are no missing values, confirming that all cells in the dataset are populated.
Exploring the Dataset
Next, let’s list all the columns present in the dataset:

This command will return a list of all the columns. To understand the diversity of our data, we can check how many unique values are present in each column:

The agent will utilize Pandas’ unique method to provide counts of unique values for each column.
Advanced Queries
Now that we have a clean dataset and understand its structure, we can ask more complex questions that require multiple steps to analyze.
Top Five Jobs by Median Salary
For instance, to find the top five jobs with the highest median salary, we’ll run:

The response will detail the process: grouping by job title, calculating median salaries, sorting the results, and selecting the top five jobs. Expect to see results like:
- Data Science Tech Lead
- Cloud Data Architect
- Data Lead
- Data Analytics Lead
- Head of Data
Analyzing Employment Types
Next, let’s find the percentage of data scientists working full-time:

The agent will break down the calculation into steps, providing a clear percentage based on the dataset.
Visualizing Data
Finally, visualization can significantly enhance our understanding of the data. To visualize the median salaries of senior-level data scientists by company size, we can create a bar plot:
