Pandas AI

October 24, 2024

The Basics

Before we dive into pandas, I'd recommend checking out the  official pandas library documentation. It's a thorough resource that covers every use case in pandas.

Installing and Importing Pandas

To install the pandas library, use one of these commands depending on your package manager:

conda install pandas

or

pip install pandas

Pandas Fundamentals

This is an in-depth introduction to pandas fundamentals, perfect for those starting with machine learning. Many online machine learning courses don't require knowledge of these topics, but understanding pandas can make following those courses much easier.

By the end of this blog, you'll have a solid grasp of pandas fundamentals, enabling you to supercharge your machine learning journey. 

Alright. Let’s get started with 2 questions which any machine learning beginner might have.

What is Pandas and why should I learn it?

Pandas is an open-source library that offers high-performance, easy-to-use data structures, and data analysis tools for the Python programming language. This is why pandas is the top choice for data preprocessing among the machine learning community.

The first step in any machine learning workflow is preparing your dataset for processing. 

A significant portion of your work as a machine learning engineer involves preparing data (loading, cleansing, transforming) before applying any machine learning algorithms.

Pandas is one of the most popular tools for this task, which is why it's heavily used in machine learning courses.

Why Pandas?

  • Simple to use.
  • Out-of-the-box integration with many other data science and ML Python tools (built on top of NumPy).
  • Helps you get your data ready for machine learning.

Given the extensive capabilities of pandas, we'll cover the following concepts within the library:

  • Pandas data structures
  • Importing and exporting data
  • Describing data
  • Viewing and selecting data
  • Manipulating data

Pandas Data Structures

There are three types of data structures in pandas:

  • Series: A one-dimensional array-like structure containing uniform data.
  • DataFrame: A two-dimensional array-like structure with diverse data, organized in a tabular format with rows and columns.
  • Panel: A three-dimensional data structure with varied data, now deprecated in the latest versions of pandas.

Let's explore these data structures in a Jupyter notebook.

Series Data Structure

Below code snippet and screenshot shows how Series data structures work in pandas:

import pandas as pd
# Series Data Structure
series_ds = pd.Series(['BMW', 'Toyota', 'Honda', 'Acura', 'Audi'])
series_ds[5] = 'Nissan'
colors_ds = pd.Series(['White', 'Black', 'Blue', 'Red', 'Green'])
colors_ds

DataFrame Data Structure

Below code snippet and screenshot shows DataFrame data structure:

If you really see we are sending data as dictionary to pd.DataFrame.

car_data_ds = pd.DataFrame({'Car make': series_ds, 'Colors': colors_ds})

Importing Data

You don't need to hand-type all the data for analysis. You can import CSV files easily. Let's see how:

  1. Upload the CSV file to your notebook.
  1. Use the following code snippet to read the CSV file:

# Import Data from csv
car_sales_2022 = pd.read_csv('2022-car-sales.csv') # here auto-complete works just hit tab after typing 202

The above image also shows clearly how a dataframe is represented in Pandas

Exporting Data

We can export data using the command below. Note that it is not limited to CSV format; we can export to Excel and other file types as well.

Exporting a DataFrame:

car_sales_2022.to_csv('exported-car-sales.csv')  # This exports the data into a file within the project directory.

After running this command, you will see a new file at the project level containing the exported data.

To demonstrate, here is an example showing that the new file is accessible in our notebook when using tab completion for verification.

However, note that an additional column was included in the export. To avoid this, we should specify not to include the index:

car_sales_2022.to_csv('exported-car-sales.csv', index=False)

As seen below, the unnamed index column is no longer present.

In addition to loading files to the project, you can read them from external sources. The example below shows how to import data from a CSV file stored on GitHub:

car_sales_2022_url = pd.read_csv('https://raw.githubusercontent.com/CheyPenmetsa/MLSamples/master/2022-car-sales.csv')

Note: When providing the file path from GitHub, ensure you use the raw file path. Similarly, this approach applies to files from Gists as well.

Describing Data

Let's explore how we can obtain more information about the data we imported. There are several attributes and functions available to get detailed insights. Here are some examples:

car_sales_2022.dtypes # Prints the column types

car_sales_2022.columns # Prints the columns
car_sales_2022.index # Prints row information
car_sales_2022.describe() # Generates descriptive statistic
scar_sales_2022.info() # Combination of dtypes and index

However, you'll notice that ‘describe()’ only worked on the "Odometer" and "Doors" columns, but not on the "Price" column. This is because the "Price" column was imported as an object data type. Let's address this issue as part of data correction:

car_sales_2022.sum()  # Sums all numeric columns
car_sales_2022['Odometer (KM)'].sum()  # Sums the values in the "Odometer (KM)" column
len(car_sales_2022)  # Returns the number of rows in the DataFrame

Viewing and Selecting Data

In this section, we’ll explore how to slice and view data from a DataFrame using various functions like head(), tail(), iloc, loc, and more. Here are some examples:

car_sales_2022.head()  # Returns the top 5 rows
car_sales_2022.tail(7)  # Returns the bottom 7 rows

Using head and tail, you can select the top or bottom rows from the DataFrame.

.loc and .iloc for Slicing Data

Below code snippet and screenshot shows how to slice data using .loc and .iloc:

langs = pd.Series(['python', 'js', 'C#', 'html', 'sql', 'css'], index=[0, 3, 5, 7, 9, 3])
# loc refers to index in the data
langs.loc[3]
# iloc refers to position in the data
langs.iloc[3]

With loc and iloc, we can select data based on position and index. Also, if you notice when creating a series or dataframe you can provide custom indices. Just to show we can still apply python concepts lets see how we can select specific items using [start:stop:stepover].

langs.iloc[:3]  # Similar to Python slicing [start:stop:stepover]
langs.iloc[:3:2]  # Step over value of 2 excludes 'js'

Selecting Specific Columns

You can select a specific column in two ways:

  1. Using dot notation (note this won’t work if there are spaces in column names).
  2. Specifying the column name inside [].

car_sales_2022['Make']car_sales_2022.Make

Filtering Data Based on Conditions

# Filtering results
car_sales_2022[car_sales_2022['Make'] == 'Acura'] 
# Filters only Acura-related rows
car_sales_2022[(car_sales_2022['Make'] == 'Acura') & (car_sales_2022['Odometer (KM)'] > 1000)]
car_sales_df[car_sales_df['Odometer (KM)'].apply(lambda x: x > 100000)]

Using & (AND) and |(OR) operators with multiple conditions in () allows for complex data filtering.

Reshaping and Displaying Data

You can reshape and display data using functions like crosstab() and groupby():

Pandas crosstab

pd.crosstab(car_sales_2022['Make'], car_sales_2022['Type'])

Pandas crosstab is used to compute a simple cross-tabulation of two (or more) factors.

Pandas groupby

Pandas groupby is used for grouping the data according to categories and applying functions to them. It helps aggregate data efficiently.

make_grp = car_sales_2022.groupby(['Make'])make_grp.first()

Basic Plotting

Now let’s see how we can do basic plotting.
If you have imported matplotlib at start of this blog then you don’t need to first 2 commands below.

%matplotlib inline# Above command tells Matplotlib to plot inside the notebook
import matplotlib.pyplot as plt

car_sales_2022['Odometer (KM)'].plot()


car_sales_2022['Type'].hist()

If you notice our prices they are imported as objects, since they have $ along with, and spaces at the end. If we try to plot on that column we see errors, let’s see how we can fix that.

car_sales_2022['Price'] = car_sales_2022['Price'].str.replace('[\$,]', '', regex=True).astype(float)
car_sales_df['Price'] = car_sales_df['Price'].apply(lambda x: str(x).replace('$', '').replace(',', '').strip()).astype(float)

You can see the difference before and after, now you can plot using the price. Now let’s go further into data manipulation using pandas.

Data Manipulation

Pandas offers a wide range of methods for data manipulation. For example, string methods from Python can be used on pandas objects. Let's see some examples:

  1. String Methods on DataFrame Columns

When you access the str accessor on a pandas Series, you can use various string methods. For instance:
The str accessor in pandas allows us to apply string methods to a Series. When we print it, we see that it returns a StringMethods object.

car_sales_2022['Make'].str
# Output: <pandas.core.strings.accessor.StringMethods at 0x11093863dd0>

Using a string method through the str accessor, such as lower(), returns a modified Series but does not update the original collection.

car_sales_2022['Make'].str.lower()
# Output: (modified Series, original DataFrame unchanged)

To update the DataFrame, we need to assign the result back to the original column.

car_sales_2022['Make'] = car_sales_2022['Make'].str.lower()
# This updates the 'Make' column in the original DataFrame.

  1. Handling Missing Data

Let's import a CSV file with missing data, which is common in real-world scenarios. Pandas will represent missing values as NaN.

To fill missing values in the 'Odometer (KM)' column with 0.0:

car_sales_2022_missing['Odometer (KM)'] = car_sales_2022_missing['Odometer (KM)'].fillna(0.0)

Alternatively, to fill missing values in place:

car_sales_2022_missing['Odometer (KM)'].fillna(0.0, inplace=True)

Other methods for handling missing data include dropna(), which removes rows with NaN values.

  1. Adding New Columns

Until now, we have been able to update existing rows and columns in a DataFrame. Now, let's see how to add new columns. Here's an example:

# Adding a column from a Series
# If we don't have data for all rows, they will be populated as NaN
air_bag_count_list = [1, 2, 2, 2, 2, 2, 4, 2, 2, 4, 4, 4]
airbag_column = pd.Series(air_bag_count_list)
car_sales_2022['AirBags'] = airbag_columncar_sales_2022

You might wonder why we don't directly assign a list to a new column. Instead, we create a Series and then assign it. Here’s why:

# Adding a column from a list
seats_list = [4, 5, 5, 7, 5, 5, 5, 5, 5, 5, 5, 5]
car_sales_2022['Seats'] = seats_listcar_sales_2022

This demonstrates that a Series can handle missing values, whereas a list must match the row count.

# Adding a column based on values from other columns
car_sales_2022['PremiumCar'] = (
    car_sales_2022['Make'].str.startswith('bmw') | car_sales_2022['Make'].str.startswith('audi')
) # Adding a scalar value to all rows
car_sales_2022['Number of wheels'] = 4
# Dropping a column
car_sales_2022 = car_sales_2022.drop('Number of wheels', axis=1) 
# axis=1 drops a column; axis=0 would drop a row

By following these steps, you can effectively add new columns to your DataFrame and manage data more flexibly.

To improve the performance of machine learning models, the data must be randomized, so we need to shuffle the dataset. Pandas provides methods for randomizing data.

# Randomizing the data so it becomes more effective for ML models
car_sales_2022.sample(frac=0.5) # 0.5 means half the data, 1 means the entire dataset
# This will reset the index and drop the index column, otherwise the reshuffled index will be inserted
car_sales_2022_shuffled.reset_index(drop=True, inplace=True)

Lambdas in Python are small anonymous functions defined using the lambda keyword. They can have any number of arguments but only one expression. They are useful when you need a simple function for a short period, especially as an argument to functions like apply.

Here’s an example using a lambda function:

# 'apply' will take anonymous functions or lambdas as input
car_sales_2022['Odometer (Miles)'] = car_sales_2022['Odometer (KM)'].apply(lambda x: x / 1.6)

In this example, the lambda function takes each value x from the 'Odometer (KM)' column and converts it to miles by dividing it by 1.6.

Introduction to PandasAI

Currently, we spend a significant amount of time editing, cleaning, and analyzing data using various methods.

Pandas is a widely-used Python library that helps us manipulate data effectively.

Data manipulation with Python involves organizing data into a structured format called a "DataFrame." DataFrames allow us to modify, clean, or analyze data. For example, we can create bar graphs, add new rows or columns, or replace missing data.

However, data manipulation can be time-consuming.

To address this, the PandasAI library, an extension of Pandas that adds Generative AI capabilities to Pandas, offers a more efficient solution for data analysis and manipulation.

Note that, It is designed to be used in conjunction with Pandas, and is not a replacement for it.

Advantages of PandasAI:

  • It is a valuable addition for users of the Pandas library.
  • It offers impressive features, such as executing language prompts similar to SQL queries and generating visualizations directly from a DataFrame.
  • It boosts productivity by automating various processes.
  • While PandasAI is a powerful tool, the Pandas library is still necessary for certain tasks, like adding missing data to a DataFrame.
  • PandasAI enhances the functionality of the Pandas library, making data handling in Python more efficient and convenient.

How does PandasAI work?

PandasAI employs a generative AI model to comprehend and interpret natural language queries. It then translates these queries into Python code and SQL commands. Using this code, it interacts with the data and provides the results back to the user.

Who Should Use PandasAI?

PandasAI is ideal for data scientists, analysts, and engineers seeking a more intuitive way to interact with their data.
It's especially useful for people who aren't familiar with SQL or Python and want to save time and effort when working with data.
Even those experienced in SQL and Python can benefit from PandasAI, as it allows them to ask questions of their data without needing to write complex code.

Official GitHub Repository — https://github.com/sinaptik-ai/pandas-ai

Codehttps://colab.research.google.com/drive/1ZnO-njhL7TBOYPZaqvMvGtsjckZKrv2E?usp=sharing

Official Documentation: https://docs.pandas-ai.com/library

In this article, I will give you an exploratory overview of some of the key features of this library that we can utilize for your data analysis tasks.

Setup

Step 1: Install PandasAI

To get started, ensure you have the latest version of PandasAI installed. You can do this by running the following command:

pip install pandasai

Note: 

  1. In Google Colab, you add an exclamation mark (!) before a pip command to tell Colab that it's a shell command. This lets you run system commands directly, making it easy to install packages or perform other tasks that go beyond Python programming.
  2. Please be aware that after installing the Pandasai module, you need to restart the kernel.

Restarting the kernel gives you a fresh start and makes sure the new modules you installed work correctly. This helps keep your coding environment stable and avoids any problems.

Creating a Smart Dataframe

What is a SmartDataframe?

A SmartDataframe is an enhanced version of a pandas (or polars) dataframe. It inherits all the properties and methods of a standard pd.DataFrame, but with added conversational capabilities, making your data more interactive.

Step 2: Import Necessary Libraries

Before we create our SmartDataframe, we need to import the required libraries and set up our data.

import pandas as pd
from pandasai import SmartDataframe

Step 3: Create a Pandas DataFrame

Let's create a sample pandas dataframe to work with.

df = pd.DataFrame({
   "country": [
       "United States",
       "United Kingdom",
       "France",
       "Germany",
       "Italy",
       "Spain",
       "Canada",
       "Australia",
       "Japan",
       "China",
   ],
   "gdp": [
       19294482071552,
       2891615567872,
       2411255037952,
       3435817336832,
       1745433788416,
       1181205135360,
       1607402389504,
       1490967855104,
       4380756541440,
       14631844184064,
   ],
   "happiness_index": [
       6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12
   ],
})


Step 4: Set Up Your API Key

PandasAI is powered by a Large Language Model (LLM). By default, it uses BambooLLM, but you can choose your preferred LLM(Large Language Model) by setting up an API key.

Sign up at https://pandabi.ai/ to get your free API token key, then set it up in your environment.

import osos.environ['PANDASAI_API_KEY'] = "YOUR_API_KEY"

Step 5: Instantiate the SmartDataframe

Now, let's create our SmartDataframe using the pandas dataframe we created.

sdf = SmartDataframe(df)


Interacting with Your Data

With your SmartDataframe ready, you can now query it using natural language.

Example Queries

  1. Top 5 Countries by GDP

sdf.chat("Return the top 5 countries by GDP")

Output:

  1. Sum of GDP of the 2 Unhappiest Countries

sdf.chat("What's the sum of the GDP of the 2 unhappiest countries?")

Output:

  1. To see the last generated code from your queries, use:

print(sdf.last_code_generated)

Output:


Visualizing Data

PandasAI also allows you to create visualizations effortlessly.

Example: Plotting a Chart

  1. Plot GDP by Country

sdf.chat("Plot a chart of the GDP by country")

Output:

  1. Customized HistogramWant to add some customization to your plots? Simply ask PandasAI!

sdf.chat("Plot a histogram of the GDP by country, using a different color for each bar")

Output:

Conclusion: Summing It All Up

In closing, you’ve now delved into the powerful tools of Pandas and PandasAI, learning how to manipulate data effortlessly and enhance your data science & AI projects. Pandas acts as your reliable assistant for data handling, while PandasAI elevates your projects with advanced AI capabilities.

Remember, mastering Pandas and PandasAI is a journey of exploration and refinement. Keep experimenting with your data, keep learning new techniques, and most importantly, enjoy the process!

Armed with these skills, you’re well-equipped to excel in the world of data science. So, go ahead and unleash your creativity! You never know – your next breakthrough could redefine how we use data in the tech world.

Thanks for joining me on this learning adventure. Happy coding, and may your data always tell insightful stories!

Author

This article was written by Karan Shah, and edited by our writers team.

Latest Articles

All Articles
Resources for building ML pipelines

Resources for building ML pipelines

🚀 "Build ML Pipelines Like a Pro!" 🔥 From data collection to model deployment, this guide breaks down every step of creating machine learning pipelines with top resources

AI/ML
Is there an AI for that?

Is there an AI for that?

Explore top AI tools transforming industries—from smart assistants like Alexa to creative powerhouses like ChatGPT and Aiva. Unlock the future of work, creativity, and business today!

AI/ML
Model Selection

Model Selection

Master the art of model selection to supercharge your machine-learning projects! Discover top strategies to pick the perfect model for flawless predictions!

AI/ML