Home
Login

A compiler that combines SQL with AI, extending SQL syntax to support machine learning model training, prediction, and evaluation.

Apache-2.0Go 5.2ksql-machine-learning Last Updated: 2024-04-18

SQLFlow Project Detailed Introduction

Project Overview

SQLFlow is a compiler that compiles SQL programs into workflows running on Kubernetes. It extends SQL syntax to support AI tasks, including training, prediction, model evaluation, model explanation, custom jobs, and mathematical programming.

Project Address: https://github.com/sql-machine-learning/sqlflow

Core Features

1. Perfect Combination of SQL and AI

SQLFlow addresses the pain points in traditional machine learning development:

  • Traditional ML application development requires multiple roles such as data engineers, data scientists, and business analysts.
  • Requires mastering multiple programming languages such as Python, SQL, SAS, Julia, and R.
  • Fragmentation of tools and development environments leads to engineering difficulties.

SQLFlow enables engineers with SQL skills to develop advanced ML applications.

2. Broad Compatibility

Supported Database Systems:

  • MySQL
  • MariaDB
  • TiDB
  • Apache Hive
  • Alibaba MaxCompute

Supported Machine Learning Frameworks:

  • TensorFlow
  • Keras
  • XGBoost

3. Extended SQL Syntax

SQLFlow extends standard SQL syntax, adding machine learning-related keywords and statements, allowing users to perform directly in SQL:

  • Model Training (TO TRAIN)
  • Model Prediction (TO PREDICT)
  • Model Evaluation
  • Feature Engineering

Usage Examples

Model Training Example

SELECT *
FROM iris.train
TO TRAIN DNNClassifier
WITH model.n_classes = 3, model.hidden_units = [10, 20]
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
INTO sqlflow_models.my_dnn_model;

Model Prediction Example

SELECT *
FROM iris.test
TO PREDICT iris.predict.class
USING sqlflow_models.my_dnn_model;

Architecture Design

The overall architecture of SQLFlow has the following characteristics:

1. Scalability

  • Supports multiple SQL engines, rather than being specific to a particular SQL engine.
  • Does not build syntax extensions based on User Defined Functions (UDFs).
  • Supports complex machine learning models and toolkits.

2. Flexibility and Ease of Use

  • Flexible enough to configure and run cutting-edge algorithms.
  • Supports advanced features such as feature crosses.
  • Easy to learn, reducing the barrier to entry.

3. Distributed Execution

The output is an Argo workflow running distributedly on a Kubernetes cluster, ensuring:

  • High availability
  • Horizontal scalability
  • Enterprise-grade deployment support

Technical Advantages

1. Differences from Existing Solutions

Microsoft SQL Server: Provides machine learning services, but requires R or Python as external scripts.

Teradata SQL for DL: Provides RESTful services that can be called from extended SQL SELECT syntax.

Google BigQuery: Enables machine learning in SQL through the CREATE MODEL statement.

SQLFlow's advantages are:

  • Fully extensible solution
  • Compatible with multiple SQL engines
  • Supports complex machine learning models
  • No need to embed Python or R code in SQL statements

2. Workflow Integration

  • Compiles SQL programs into Kubernetes workflows
  • Supports Argo workflow orchestration
  • Cloud-native architecture design

Community and Ecosystem

Academic Support

Open Source Ecosystem

SQLFlow has a complete open-source ecosystem:

  • Main project: sql-machine-learning/sqlflow
  • Python client: sql-machine-learning/pysqlflow
  • Zeppelin integration: sql-machine-learning/zeppelin-sqlflow
  • Official website: sql-machine-learning.github.io

Applicable Scenarios

SQLFlow is particularly suitable for the following scenarios:

  • Enterprise data analysis teams want to lower the barrier to ML development.
  • Business scenarios that require machine learning directly in SQL queries.
  • Organizations that want to unify data processing and machine learning workflows.
  • Enterprises that need scalable, cloud-native ML solutions.

Summary

SQLFlow provides a powerful and easy-to-use machine learning platform for data professionals by perfectly combining SQL and AI. It not only lowers the barrier to entry for machine learning but also ensures enterprise-grade scalability and reliability through a cloud-native architecture. SQLFlow is an ideal choice for organizations looking to seamlessly integrate machine learning capabilities into their data workflows.