A Python platform for conversing with databases and data lakes using natural language, leveraging LLM and RAG technologies to make data analysis conversational.
PandasAI Project: A Detailed Overview
Project Overview
PandasAI is an innovative Python platform that enables users to converse with databases and data lakes using natural language. Developed by the Sinaptik-AI team, this project aims to make data analysis more intuitive and accessible, regardless of the user's technical background.
GitHub Address: https://github.com/Sinaptik-AI/pandas-ai
Core Features
1. Natural Language Data Querying
- Supports asking data-related questions using natural language
- No need to write complex SQL queries or Python code
- Suitable for both non-technical and technical users
2. Support for Multiple Data Sources
- Databases: SQL databases
- File Formats: CSV, Parquet files
- DataFrames: Pandas DataFrames
- Others: MongoDB, NoSQL, etc.
3. Integration of LLM and RAG Technologies
- Utilizes Large Language Models (LLM) to understand natural language queries
- Employs Retrieval-Augmented Generation (RAG) technology to improve query accuracy
- Defaults to BambooLLM, but also supports other LLMs
4. Data Visualization
- Automatically generates various charts
- Supports multiple chart types such as histograms, bar charts, etc.
- Customizable chart styles and colors
Technical Features
System Requirements
- Python 3.8+ < 3.12
- Supports Jupyter notebooks and Streamlit applications
- Offers a client-server architecture
Installation
Using pip:
pip install "pandasai>=3.0.0b2"
Using poetry:
poetry add "pandasai>=3.0.0b2"
Basic Usage Examples
Single DataFrame Query
import pandasai as pai
# Create a sample DataFrame
df = pai.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"revenue": [5000, 3200, 2900, 4100, 2300, 2100, 2500, 2600, 4500, 7000]
})
# Set API key
pai.api_key.set("your-pai-api-key")
# Perform natural language query
df.chat('Which are the top 5 countries by sales?')
Multi-DataFrame Join Query
import pandasai as pai
employees_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Name': ['John', 'Emma', 'Liam', 'Olivia', 'William'],
'Department': ['HR', 'Sales', 'IT', 'Marketing', 'Finance']
}
salaries_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Salary': [5000, 6000, 4500, 7000, 5500]
}
employees_df = pai.DataFrame(employees_data)
salaries_df = pai.DataFrame(salaries_data)
pai.api_key.set("your-pai-api-key")
pai.chat("Who gets paid the most?", employees_df, salaries_df)
Generating Charts
df.chat(
"Plot the histogram of countries showing for each one the gd. Use different colors for each bar",
)
Advanced Features
Data Platform Integration
PandasAI offers powerful data platform integration capabilities, allowing easy data upload and sharing:
import pandasai as pai
pai.api_key.set("your-pai-api-key")
file = pai.read_csv("./filepath.csv")
dataset = pai.create(path="your-organization/dataset-name",
df=file,
name="dataset-name",
description="dataset-description")
dataset.push()
Docker Sandbox Environment
To provide a secure code execution environment, PandasAI supports a Docker sandbox:
pip install "pandasai-docker"
import pandasai as pai
from pandasai_docker import DockerSandbox
# Initialize the sandbox
sandbox = DockerSandbox()
sandbox.start()
# Execute queries in the sandbox
pai.chat("Who gets paid the most?", employees_df, salaries_df, sandbox=sandbox)
# Stop the sandbox
sandbox.stop()
Use Cases
Target Audience
- Non-technical users: Analyze data without learning SQL or Python
- Data Analysts: Quickly explore and analyze data
- Developers: Integrate into existing applications
- Enterprise Users: Build internal data analysis tools
Typical Applications
- Business Intelligence Analysis
- Data Exploration and Visualization
- Report Generation
- Education and Training
- Prototyping
Technical Architecture
Core Components
- Natural Language Processing: Understanding user query intent
- Code Generation: Converting natural language into executable code
- Secure Execution: Safely executing code in a sandbox environment
- Result Presentation: Formatting and displaying query results
Extensibility
- Supports multiple LLM backends
- Customizable data connectors
- Pluggable architecture for easy extension
Conclusion
PandasAI represents a significant innovation in the field of data analysis. By leveraging natural language processing and LLM technologies, it greatly lowers the technical barrier to data analysis. It is suitable not only for individual users performing data exploration but also for enterprises building intelligent data analysis platforms. As AI technology continues to evolve, such tools will play an increasingly vital role in data-driven decision-making.