Skip to main content

Exploratory Data Analysis with Claude

In this project, we’ll discuss how you can enhance your exploratory data analysis (EDA) experience by integrating a Large Language Model (such as Claude) with Pandas DataFrame and the Agent Toolkit. Let’s dive in!

Data Analysis Project with LLM and GenAI For Beginners
Data Analysis Project with LLM and GenAI For Beginners

Getting Started: Installing Necessary Libraries

To kick things off, we need to set up our environment by downloading and installing some essential libraries and packages. Here’s what you’ll need:

  • LangChain
  • Anthropic
  • Matplotlib (for plotting graphs)
  • Pandas
  • openpyxl
  • tabulate
Install Dependencies
Install Dependencies

Once you have these libraries installed, we can proceed to the next step: setting up our model and agent.

Importing Libraries and Setting Up the Environment

Start by importing the necessary libraries to perform our operations. You’ll want to import the following:

import os
from langchain_anthropic import ChatAnthropic
from langchain_experimental.agents import create_pandas_dataframe_agent

import pandas as pd

Next, you’ll need to set up your Anthropic API key. If you’re using an IDE, create a .env file to store your API key securely.

Using the Dataset

For our analysis, we’ll utilize the Data Science Salaries 2023 dataset, which consists of approximately 11 columns, each with detailed descriptions. Upload the dataset into your Jupyter Notebook, naming the file ds_salaries.csv.

To load the dataset into a DataFrame, provide the path to your CSV file:

df = pd.read_csv(‘/content/ds_salaries.csv’)

To verify that the dataset has been loaded correctly, run:

df.head()

This command will display the first five rows of your dataset, confirming a successful load.

Initializing the Agent

Now, it’s time to initialize our agent. Set the temperature parameter to zero to ensure that the responses are straightforward and not overly creative. We’ll be using the Claude 3.5 Sonnet model for our analysis.

Here’s how to initialize the agent:

llm=ChatAnthropic(temperature=0,model_name=”claude-3-5-sonnet-20240620″)

agent = create_pandas_dataframe_agent(llm, df, verbose=True,allow_dangerous_code=True)

With the agent initialized, we can now run various queries on our dataset.

Running Queries

To interact with the dataset, we’ll use the agent.run method. Let’s start with a basic query to find out how many rows and columns are in our dataset:

agent.run(‘how many rows and columns are there in the dataset?’)

Upon executing this command, the agent will provide the answer: 3755 rows and 11 columns.

Query1 Result
Query1 Result

Data Cleaning

Before diving deeper, it’s crucial to ensure our data is clean. To check for any missing values, run:

agent.run(‘are there any missing values?’)
Query2 Result
Query2 Result

The output will indicate that there are no missing values, confirming that all cells in the dataset are populated.

Exploring the Dataset

Next, let’s list all the columns present in the dataset:

agent.run(‘what are the columns?’)
Query3 Result
Query3 Result

This command will return a list of all the columns. To understand the diversity of our data, we can check how many unique values are present in each column:

agent.run(‘how many categories are in each column?’)
Query4 Result
Query4 Result

The agent will utilize Pandas’ unique method to provide counts of unique values for each column.

Advanced Queries

Now that we have a clean dataset and understand its structure, we can ask more complex questions that require multiple steps to analyze.

Top Five Jobs by Median Salary

For instance, to find the top five jobs with the highest median salary, we’ll run:

agent.run(‘which are the top 5 jobs that have the highest median salary?’)
Query5 Result
Query5 Result

The response will detail the process: grouping by job title, calculating median salaries, sorting the results, and selecting the top five jobs. Expect to see results like:

  • Data Science Tech Lead
  • Cloud Data Architect
  • Data Lead
  • Data Analytics Lead
  • Head of Data

Analyzing Employment Types

Next, let’s find the percentage of data scientists working full-time:

agent.run(‘what is the percentage of data scientists who are working full time?’)
Query6 Result
Query6 Result

The agent will break down the calculation into steps, providing a clear percentage based on the dataset.

Visualizing Data

Finally, visualization can significantly enhance our understanding of the data. To visualize the median salaries of senior-level data scientists by company size, we can create a bar plot:

agent.run(“get median salaries of senior-level data scientists for each company size and plot them in a bar plot.”)

 

Query7 Result: Bar chart
Query7 Result: Bar chart
0
    0
    Your Cart
    Your cart is emptyReturn to Courses