0 Followers · 1.1K Posts

SQL is a standard language for storing, manipulating and retrieving data in relational databases.

Question Yone Moreno Jiménez · Jun 23, 2025

Hello, good morning, thank you so much for reading this question. ☺️🙂👍

We are developing a code to get information about our Production's items: services, processes and operations.

We know we can get various configurations of a given item: Category, Port, Enabled...

But we wonder how we could get the date time of the last mesage (most recent) received in an item.

To give a code snippet a small section of the code we have developed (and tested), it looks like:

2
0 81
Question Eduard Lebedyuk · May 24, 2017

How do I write DDL script for collection properties?

For example I want to create the following class:

Class SQLUser.Person {

Property Name As %String;

Property FavoriteColors As list Of %String;

}

My DDL script looks like this:

CREATE TABLE Person (Name varchar(50), FavoriteColors ???)
5
0 429
Article Guillaume Rongier · Jul 8, 2024 6m read

Flask_logo

Description

This is a template for a Flask application that can be deployed in IRIS as an native Web Application.

Installation

  1. Clone the repository
  2. Create a virtual environment
  3. Install the requirements
  4. Run the docker-compose file
git clone
cd iris-flask-template
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
docker-compose up

Usage

The base URL is http://localhost:53795/flask/.

Endpoints

  • /iris - Returns a JSON object with the top 10 classes present in the IRISAPP namespace.
  • /interop - A ping endpoint to test the interoperability framework of IRIS.
  • /posts - A simple CRUD endpoint for a Post object.
  • /comments - A simple CRUD endpoint for a Comment object.

How to develop from this template

See WSGI introduction article: wsgi-introduction.

TL;DR : You can toggle the DEBUG flag in the Security portal to make changes to be reflected in the application as you develop.

Code presentation

app.py

This is the main file of the application. It contains the Flask application and the endpoints.

from flask import Flask, jsonify, request
from models import Comment, Post, init_db

from grongier.pex import Director

import iris

app = Flask(__name__)
app.config['SQLALCHEMY_DATABASE_URI'] = 'iris+emb://IRISAPP'

db = init_db(app)
  • from flask import Flask, jsonify, request: Import the Flask library.
  • from models import Comment, Post, init_db: Import the models and the database initialization function.
  • from grongier.pex import Director: Import the Director class to bind the flask app to the IRIS interoperability framework.
  • import iris: Import the IRIS library.
  • app = Flask(__name__): Create a Flask application.
  • app.config['SQLALCHEMY_DATABASE_URI'] = 'iris+emb://IRISAPP': Set the database URI to the IRISAPP namespace.
    • The iris+emb URI scheme is used to connect to IRIS as an embedded connection (no need for a separate IRIS instance).
  • db = init_db(app): Initialize the database with the Flask application.

models.py

This file contains the SQLAlchemy models for the application.

from dataclasses import dataclass
from typing import List
from flask_sqlalchemy import SQLAlchemy

db = SQLAlchemy()

@dataclass
class Comment(db.Model):
    id:int = db.Column(db.Integer, primary_key=True)
    content:str = db.Column(db.Text)
    post_id:int = db.Column(db.Integer, db.ForeignKey('post.id'))

@dataclass
class Post(db.Model):
    __allow_unmapped__ = True
    id:int = db.Column(db.Integer, primary_key=True)
    title:str = db.Column(db.String(100))
    content:str = db.Column(db.Text)
    comments:List[Comment] = db.relationship('Comment', backref='post')

Not much to say here, the models are defined as dataclasses and are subclasses of the db.Model class.

The use of the __allow_unmapped__ attribute is necessary to allow the creation of the Post object without the comments attribute.

dataclasses are used to help with the serialization of the objects to JSON.

The init_db function initializes the database with the Flask application.

def init_db(app):
    db.init_app(app)

    with app.app_context():
        db.drop_all()
        db.create_all()
        # Create fake data
        post1 = Post(title='Post The First', content='Content for the first post')
        ...
        db.session.add(post1)
        ...
        db.session.commit()
    return db
  • db.init_app(app): Initialize the database with the Flask application.
  • with app.app_context(): Create a context for the application.
  • db.drop_all(): Drop all the tables in the database.
  • db.create_all(): Create all the tables in the database.
  • Create fake data for the application.
  • return the database object.

/iris endpoint

######################
# IRIS Query example #
######################

@app.route('/iris', methods=['GET'])
def iris_query():
    query = "SELECT top 10 * FROM %Dictionary.ClassDefinition"
    rs = iris.sql.exec(query)
    # Convert the result to a list of dictionaries
    result = []
    for row in rs:
        result.append(row)
    return jsonify(result)

This endpoint executes a query on the IRIS database and returns the top 10 classes present in the IRISAPP namespace.

/interop endpoint

########################
# IRIS interop example #
########################
bs = Director.create_python_business_service('BS')

@app.route('/interop', methods=['GET', 'POST', 'PUT', 'DELETE'])
def interop():
    
    rsp = bs.on_process_input(request)

    return jsonify(rsp)

This endpoint is used to test the interoperability framework of IRIS. It creates a Business Service object and binds it to the Flask application.

NB : The bs object must be outside of the scope of the request to keep it alive.

  • bs = Director.create_python_business_service('BS'): Create a Business Service object named 'BS'.
  • rsp = bs.on_process_input(request): Call the on_process_input method of the Business Service object with the request object as an argument.

/posts endpoint

############################
# CRUD operations posts    #
############################

@app.route('/posts', methods=['GET'])
def get_posts():
    posts = Post.query.all()
    return jsonify(posts)

@app.route('/posts', methods=['POST'])
def create_post():
    data = request.get_json()
    post = Post(title=data['title'], content=data['content'])
    db.session.add(post)
    db.session.commit()
    return jsonify(post)

@app.route('/posts/<int:id>', methods=['GET'])
def get_post(id):
    ...

This endpoint is used to perform CRUD operations on the Post object.

Thanks to the dataclasses module, the Post object can be easily serialized to JSON.

Here we use the sqlalchemy query method to get all the posts, and the add and commit methods to create a new post.

/comments endpoint

############################
# CRUD operations comments #
############################

@app.route('/comments', methods=['GET'])
def get_comments():
    comments = Comment.query.all()
    return jsonify(comments)

@app.route('/comments', methods=['POST'])
def create_comment():
    data = request.get_json()
    comment = Comment(content=data['content'], post_id=data['post_id'])
    db.session.add(comment)
    db.session.commit()
    return jsonify(comment)

@app.route('/comments/<int:id>', methods=['GET'])
def get_comment(id):
    ...

This endpoint is used to perform CRUD operations on the Comment object.

The Comment object is linked to the Post object by a foreign key.

Troubleshooting

How to run the Flask application in a standalone mode

You can always run a standalone Flask application with the following command:

python3 /irisdev/app/community/app.py

NB : You must be inside of the container to run this command.

docker exec -it iris-flask-template-iris-1 bash

Restart the application in IRIS

Be in DEBUG mode make multiple calls to the application, and the changes will be reflected in the application.

How to access the IRIS Management Portal

You can access the IRIS Management Portal by going to http://localhost:53795/csp/sys/UtilHome.csp.

Run this template locally

For this you need to have IRIS installed on your machine.

Next you need to create a namespace named IRISAPP.

Install the requirements.

Install IoP :

#init iop
iop --init

# load production
iop -m /irisdev/app/community/interop/settings.py

# start production
iop --start Python.Production

Configure the application in the Security portal.

4
1 491
Article Harry Tong · Jun 6, 2025 2m read

If you're migrating from Oracle to InterSystems IRIS—like many of my customers—you may run into Oracle-specific SQL patterns that need translation.

Take this example:

SELECT (TO_DATE('2023-05-12','YYYY-MM-DD') - LEVEL + 1) AS gap_date
FROM dual
CONNECT BY LEVEL <= (TO_DATE('2023-05-12','YYYY-MM-DD') - TO_DATE('2023-05-02','YYYY-MM-DD') + 1);

In Oracle:

1
0 99
Question Andrew McCrevan · Jun 5, 2025

How would you go about creating an SQL Stored Procedure that would result in the same output as "IN LIKE"?
For example...
CustomersTable

RecordId CustomerName
123 Mark Stevens
456 Betty Johnson
789 John Stevens
321 Brian Smith
654 John Markson
987 Tom Obrian
select *
from CustomersTable
where inLike('%Mark%', '%John%')

Would return:

123 Mark Stevens
456 Betty Johnson
789 John Stevens
654 John Markson
5
0 99
Question Scott Roth · Jun 4, 2025

Does anyone have a query that I could run to show a Vendor the time difference between when a message was sent out a BO, and when we received the HL7 ACK back that is associated with the message sent?

I am trying to prove to this vendor of the delay we are seeing getting the ACK back because of a Timeout

I know how to pull Ens.MessageHeader, and EnsLib.HL7.Message but not sure how to match up the Message with the HL7 Acknowledgement received.

1
0 58
Question Kurro Lopez · May 28, 2025

Hi community.

I was wondering if it was possible to use something like EnsLib.SQL.InboundAdapter with tables in IRIS.

This library monitors when a record has been inserted into a table in an external database, so it requires a DSN to connect to that database.

My goal is to make a call to an external API that takes a long time, it could spend nearly an hour (or more) completing its processes, but I don't want to block the main process.

2
1 80
Question Fabio Care · May 27, 2025

I am testing vectorsearch, while doing so I am trying to paginate my resultset for a "next page" function to give me the first, second, third 15 entries within a table. 

For this I have two embedding classes. One with a HNSW Index (vectornomicembedtextlatest) , and one without (vectornomicembedtexttest).

Calling SELECT ID,PRIMKEY FROM SQLUser.vectornomicembedtexttest LIMIT 5 OFFSET 1 works fine with the first entry having the rowID of 486448. (I deleted old entries in the beginning and reused the table)

0
0 64
Announcement Andreas Schneider · May 25, 2025


I'm excited to announce a major update to SQL Data Lens – a powerful database client and metadata explorer – that opens up new, free possibilities for the InterSystems community.

SQL Data Lens is now completely FREE to use with InterSystems IRIS Community Edition!

No more “localhost only” restrictions
No more limits on the number of connections

No license? No problem.
You can now connect to InterSystems IRIS Community Edition—completely license-free—using the fully functional Free Edition of SQL DATA LENS. Explore all the features, no strings attached.
 

1
0 93
Article Benjamin De Boe · Nov 9, 2023 3m read

With the release of InterSystems IRIS Cloud SQL, we're getting more frequent questions about how to establish secure connections over JDBC and other driver technologies. While we have nice summary and detailed documentation on the driver technologies themselves, our documentation does not go as far to describe individual client tools, such as our personal favourite DBeaver. In this article, we'll describe the steps to create a secure connection from DBeaver to your Cloud SQL deployment.

21
2 2085
Question Martin Nielsen · May 2, 2025

Hi, I am unsure how to remove this restriction; when I am performing dynamic SQL using ##class(%SQL.Statement).%ExecDirectNoPriv(, .query, args...)

It works fine, but the moment I add specific properties from the persistent class I am performing the select on into the WHERE clause, I get: ERROR #5540: SQLCODE: -99 Message: User UnknownUser is not privileged for the operation. Despite using %ExecDirectNoPriv, I've tried with prepared statement as well, exact same situation.

5
0 156
Question Ashok Kumar T · May 8, 2025

Hello Community,

I executed the below query in the %ZLANGC00.mac routine. It returns results when using dynamic SQL, but not with embedded SQL(returns 0). 

 &SQL(SELECTcount(*) INTO :Cnt FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE='BASE TABLE'")
 Write "SQLTablecount : "_Cnt,!
 Set tResultSet = ##class(%SQL.Statement).%ExecDirect(,"SELECTcount(*) FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE='BASE TABLE'")
 Do tResultSet.%Display()

Thanks!

4
0 76
Article sween · Mar 4, 2024 8m read

If you are a customer of the new InterSystems IRIS® Cloud SQL and InterSystems IRIS® Cloud IntegratedML® cloud offerings and want access to the metrics of your deployments and send them to your own Observability platform, here is a quick and dirty way to get it done by sending the metrics to Google Cloud Platform Monitoring (formerly StackDriver).

1
2 338
Question MARK PONGONIS · Jul 17, 2018

Would like to know if there is an alternative or better way to paginate through a dataset using dynamic SQL than what I am using below. The problem is that as the potential pool of data gets larger, this code slows down to the point of not being useable. In analyzing each line of code below, it appears the slow down is related to the initial rset.%Next() iteration. Is there anything available which does not require a subquery/%VID such as a simple LIMIT/OFFSET?

My code is similar to :

s sql=##class(%SQL.Statement).%New()

31
3 3604
Article Irene Mykhailova · Apr 21, 2025 3m read

Migrating from Oracle, MSSQL, or other purely relational database systems to a multimodel InterSystems IRIS is a strategic decision that requires careful planning and execution. While this transition offers significant benefits, including enhanced performance, scalability, and support for modern architectures, it also comes with challenges. In this article I will highlight some of the considerations connected to coding to ensure a successful migration. I will leave everything connected to an actual migration of structures and data outside the scope of this article.


First, when you're considering migrating to a different database system, you need to understand your business logic, whether it's on the side of the application (application server) or the database server. Basically, where do you have your SQL statements that you will need to potentially rewrite

1
0 197
Question Scott Roth · Apr 28, 2025

I am struggling with returning a SQL Query Result that may have multiple rows that can be searched.

set query = "SELECT Facility FROM FROM osuwmc_EnterpriseDirDB.RelationshipMedCtrID WHERE OSUmedcenterID = ?"SET rset = ##class(%SQL.Statement).%New()
 SET qStatus = rset.%Prepare(query)
 SET rset = rset.%Execute($Get(ID))
 do rset.%Display()

I need to take the values that are returned and say search them for a single value. How would I go about returning the EnsLib.SQL.Snapshot into a Array or List for it to be searchable.

4
0 115
Article Irene Mykhailova · Apr 25, 2025 4m read

When working with InterSystems IRIS, database developers and architects often face a critical decision: whether to use Dynamic SQL or Embedded SQL for querying and updating data. Both methods have their unique strengths and use cases, but understanding their performance implications is essential to making the right choice. Response time, a key metric in evaluating application performance, can vary significantly depending on the SQL approach used. Dynamic SQL offers flexibility, as queries can be constructed and executed at runtime, making it ideal for scenarios with unpredictable or highly variable query needs. Conversely, Embedded SQL emphasizes stability and efficiency by integrating SQL code directly into application logic, offering optimized response times for predefined query patterns.

In this article, I will explore the response times when using these two types of SQL and how they depend on different class structures and usage of parameters. So to do this, I'm going to use the following classes from the diagram:

3
0 185
InterSystems Official Daniel Palevski · Apr 22, 2025

The 2024.1.4 and 2023.1.6 maintenance releases of InterSystems IRIS® data platform, InterSystems IRIS® for HealthTM, and HealthShare® Health Connect are now Generally Available (GA). These releases include the fixes for the following alert recently issued - Alert: SQL Queries Returning Wrong Results | InterSystems. Please share your feedback through the Developer Community so we can build a better product together.

Documentation

You can find the detailed change lists & upgrade checklists on these pages:

0
0 159
Question Scott Roth · Apr 22, 2025

I was wondering if someone could help me. In the past I have been able to call external Stored Procedures through a SQL Outbound Connection and have them return me the EnsLib.SQL.Snapshot to use within a BPL to extract data.

But this time instead of using a SQL Outbound BO to make the Stored Procedure call, I decided to create a Linked Stored Procedure through the %JDBC_Server to point to the Stored Procedure out on MS SQL.

However, I am struggling to get the code just right to return the Column value from the Linked Stored Procedure.

2
0 81
Article José Pereira · Aug 2, 2024 28m read

An experiment on how to use the LangChain framework, IRIS Vector Search, and LLMs to generate IRIS-compatible SQL from user prompts.

This article was based in this notebook. You can run it with a ready to use environment with this application in OpenExchange.

Setup

First, we need to install the necessary libraries:

!pip install --upgrade --quiet langchain langchain-openai langchain-iris pandas

Next, we import the required modules and set up the environment:

import os
import datetime
import hashlib
from copy import deepcopy
from sqlalchemy import create_engine
import getpass
import pandas as pd
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.example_selectors import SemanticSimilarityExampleSelector
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.docstore.document import Document
from langchain_community.document_loaders import DataFrameLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain.globals import set_llm_cache
from langchain.cache import SQLiteCache
from langchain_iris import IRISVector

We will use SQLiteCache to cache the LLM calls:

# Cache for LLM calls
set_llm_cache(SQLiteCache(database_path=".langchain.db"))

Set the IRIS database connection parameters:

# IRIS database connection parameters
os.environ["ISC_LOCAL_SQL_HOSTNAME"] = "localhost"
os.environ["ISC_LOCAL_SQL_PORT"] = "1972"
os.environ["ISC_LOCAL_SQL_NAMESPACE"] = "IRISAPP"
os.environ["ISC_LOCAL_SQL_USER"] = "_system"
os.environ["ISC_LOCAL_SQL_PWD"] = "SYS"

If the OpenAI API key is not already set in the environment, prompt the user to enter it:

if not "OPENAI_API_KEY" in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass()

Create the connection string for the IRIS database:

# IRIS database connection string
args = {
    'hostname': os.getenv("ISC_LOCAL_SQL_HOSTNAME"), 
    'port': os.getenv("ISC_LOCAL_SQL_PORT"), 
    'namespace': os.getenv("ISC_LOCAL_SQL_NAMESPACE"), 
    'username': os.getenv("ISC_LOCAL_SQL_USER"), 
    'password': os.getenv("ISC_LOCAL_SQL_PWD")
}
iris_conn_str = f"iris://{args['username']}:{args['password']}@{args['hostname']}:{args['port']}/{args['namespace']}"

Establish the connection to the IRIS database:

# Connection to IRIS database
engine = create_engine(iris_conn_str)
cnx = engine.connect().connection

Prepare a dictionary to hold context information for the system prompt:

# Dict for context information for system prompt
context = {}
context["top_k"] = 3

Prompt Creation

To transform user inputs into SQL queries compatible with the IRIS database, we need to create an effective prompt for the language model. We start with an initial prompt that provides basic instructions for generating SQL queries. This template is derived from LangChain's default prompts for MSSQL and customized for the IRIS database.

# Basic prompt template with IRIS database SQL instructions
iris_sql_template = """
You are an InterSystems IRIS expert. Given an input question, first create a syntactically correct InterSystems IRIS query to run and return the answer to the input question.
Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the TOP clause as per InterSystems IRIS. You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in single quotes ('') to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use CAST(CURRENT_DATE as date) function to get the current date, if the question involves "today".
Use double quotes to delimit columns identifiers.
Return just plain SQL; don't apply any kind of formatting.
"""

This basic prompt configures the language model (LLM) to function as an SQL expert with specific guidance for the IRIS database. Next, we provide an auxiliary prompt with information about the database schema to avoid hallucinations.

# SQL template extension for including tables context information
tables_prompt_template = """
Only use the following tables:
{table_info}
"""

To enhance the accuracy of the LLM's responses, we use a technique called few-shot prompting. This involves presenting some examples to the LLM.

# SQL template extension for including few shots
prompt_sql_few_shots_template = """
Below are a number of examples of questions and their corresponding SQL queries.

{examples_value}
"""

We define the template for few-shot examples:

# Few shots prompt template
example_prompt_template = "User input: {input}\nSQL query: {query}"
example_prompt = PromptTemplate.from_template(example_prompt_template)

We build the user prompt using the few-shot template:

# User prompt template
user_prompt = "\n" + example_prompt.invoke({"input": "{input}", "query": ""}).to_string()

Finally, we compose all prompts to create the final one:

# Complete prompt template
prompt = (
    ChatPromptTemplate.from_messages([("system", iris_sql_template)])
    + ChatPromptTemplate.from_messages([("system", tables_prompt_template)])
    + ChatPromptTemplate.from_messages([("system", prompt_sql_few_shots_template)])
    + ChatPromptTemplate.from_messages([("human", user_prompt)])
)
prompt

This prompt expects the variables examples_value, input, table_info, and top_k.

Here is how the prompt is structured:

ChatPromptTemplate(
    input_variables=['examples_value', 'input', 'table_info', 'top_k'], 
    messages=[
        SystemMessagePromptTemplate(
            prompt=PromptTemplate(
                input_variables=['top_k'], 
                template=iris_sql_template
            )
        ), 
        SystemMessagePromptTemplate(
            prompt=PromptTemplate(
                input_variables=['table_info'], 
                template=tables_prompt_template
            )
        ), 
        SystemMessagePromptTemplate(
            prompt=PromptTemplate(
                input_variables=['examples_value'], 
                template=prompt_sql_few_shots_template
            )
        ), 
        HumanMessagePromptTemplate(
            prompt=PromptTemplate(
                input_variables=['input'], 
                template=user_prompt
            )
        )
    ]
)

To visualize how the prompt will be sent to the LLM, we can use placeholder values for the required variables:

prompt_value = prompt.invoke({
    "top_k": "<top_k>",
    "table_info": "<table_info>",
    "examples_value": "<examples_value>",
    "input": "<input>"
})
print(prompt_value.to_string())
System: 
You are an InterSystems IRIS expert. Given an input question, first create a syntactically correct InterSystems IRIS query to run and return the answer to the input question.
Unless the user specifies in the question a specific number of examples to obtain, query for at most <top_k> results using the TOP clause as per InterSystems IRIS. You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in single quotes ('') to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use CAST(CURRENT_DATE as date) function to get the current date, if the question involves "today".
Use double quotes to delimit columns identifiers.
Return just plain SQL; don't apply any kind of formatting.

System: 
Only use the following tables:
<table_info>

System: 
Below are a number of examples of questions and their corresponding SQL queries.

<examples_value>

Human: 
User input: <input>
SQL query: 

Now, we are ready to send this prompt to the LLM by providing the necessary variables. Let's move on to the next step when you're ready.

Providing Table Information

To create accurate SQL queries, we need to provide the language model (LLM) with detailed information about the database tables. Without this information, the LLM might generate queries that seem plausible but are incorrect due to hallucinations. Therefore, our first step is to create a function that retrieves table definitions from the IRIS database.

Function to Retrieve Table Definitions

The following function queries the INFORMATION_SCHEMA to get the table definitions for a specified schema. If a specific table is provided, it retrieves the definition for that table; otherwise, it retrieves definitions for all tables in the schema.

def get_table_definitions_array(cnx, schema, table=None):
    cursor = cnx.cursor()

    # Base query to get columns information
    query = """
    SELECT TABLE_SCHEMA, TABLE_NAME, COLUMN_NAME, DATA_TYPE, IS_NULLABLE, COLUMN_DEFAULT, PRIMARY_KEY, null EXTRA
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_SCHEMA = %s
    """
    
    # Parameters for the query
    params = [schema]

    # Adding optional filters
    if table:
        query += " AND TABLE_NAME = %s"
        params.append(table)
    
    # Execute the query
    cursor.execute(query, params)

    # Fetch the results
    rows = cursor.fetchall()
    
    # Process the results to generate the table definition(s)
    table_definitions = {}
    for row in rows:
        table_schema, table_name, column_name, column_type, is_nullable, column_default, column_key, extra = row
        if table_name not in table_definitions:
            table_definitions[table_name] = []
        table_definitions[table_name].append({
            "column_name": column_name,
            "column_type": column_type,
            "is_nullable": is_nullable,
            "column_default": column_default,
            "column_key": column_key,
            "extra": extra
        })

    primary_keys = {}
    
    # Build the output string
    result = []
    for table_name, columns in table_definitions.items():
        table_def = f"CREATE TABLE {schema}.{table_name} (\n"
        column_definitions = []
        for column in columns:
            column_def = f"  {column['column_name']} {column['column_type']}"
            if column['is_nullable'] == "NO":
                column_def += " NOT NULL"
            if column['column_default'] is not None:
                column_def += f" DEFAULT {column['column_default']}"
            if column['extra']:
                column_def += f" {column['extra']}"
            column_definitions.append(column_def)
        if table_name in primary_keys:
            pk_def = f"  PRIMARY KEY ({', '.join(primary_keys[table_name])})"
            column_definitions.append(pk_def)
        table_def += ",\n".join(column_definitions)
        table_def += "\n);"
        result.append(table_def)

    return result

Retrieve Table Definitions for a Schema

For this example, we use the Aviation schema, which is available here.

# Retrieve table definitions for the Aviation schema
tables = get_table_definitions_array(cnx, "Aviation")
print(tables)

This function returns the CREATE TABLE statements for all tables in the Aviation schema:

[
    'CREATE TABLE Aviation.Aircraft (\n  Event bigint NOT NULL,\n  ID varchar NOT NULL,\n  AccidentExplosion varchar,\n  AccidentFire varchar,\n  AirFrameHours varchar,\n  AirFrameHoursSince varchar,\n  AirFrameHoursSinceLastInspection varchar,\n  AircraftCategory varchar,\n  AircraftCertMaxGrossWeight integer,\n  AircraftHomeBuilt varchar,\n  AircraftKey integer NOT NULL,\n  AircraftManufacturer varchar,\n  AircraftModel varchar,\n  AircraftRegistrationClass varchar,\n  AircraftSerialNo varchar,\n  AircraftSeries varchar,\n  Damage varchar,\n  DepartureAirportId varchar,\n  DepartureCity varchar,\n  DepartureCountry varchar,\n  DepartureSameAsEvent varchar,\n  DepartureState varchar,\n  DepartureTime integer,\n  DepartureTimeZone varchar,\n  DestinationAirportId varchar,\n  DestinationCity varchar,\n  DestinationCountry varchar,\n  DestinationSameAsLocal varchar,\n  DestinationState varchar,\n  EngineCount integer,\n  EvacuationOccurred varchar,\n  EventId varchar NOT NULL,\n  FlightMedical varchar,\n  FlightMedicalType varchar,\n  FlightPhase integer,\n  FlightPlan varchar,\n  FlightPlanActivated varchar,\n  FlightSiteSeeing varchar,\n  FlightType varchar,\n  GearType varchar,\n  LastInspectionDate timestamp,\n  LastInspectionType varchar,\n  Missing varchar,\n  OperationDomestic varchar,\n  OperationScheduled varchar,\n  OperationType varchar,\n  OperatorCertificate varchar,\n  OperatorCertificateNum varchar,\n  OperatorCode varchar,\n  OperatorCountry varchar,\n  OperatorIndividual varchar,\n  OperatorName varchar,\n  OperatorState varchar,\n  Owner varchar,\n  OwnerCertified varchar,\n  OwnerCountry varchar,\n  OwnerState varchar,\n  RegistrationNumber varchar,\n  ReportedToICAO varchar,\n  SeatsCabinCrew integer,\n  SeatsFlightCrew integer,\n  SeatsPassengers integer,\n  SeatsTotal integer,\n  SecondPilot varchar,\n  childsub bigint NOT NULL DEFAULT $i(^Aviation.EventC("Aircraft"))\n);',
    'CREATE TABLE Aviation.Crew (\n  Aircraft varchar NOT NULL,\n  ID varchar NOT NULL,\n  Age integer,\n  AircraftKey integer NOT NULL,\n  Category varchar,\n  CrewNumber integer NOT NULL,\n  EventId varchar NOT NULL,\n  Injury varchar,\n  MedicalCertification varchar,\n  MedicalCertificationDate timestamp,\n  MedicalCertificationValid varchar,\n  Seat varchar,\n  SeatbeltUsed varchar,\n  Sex varchar,\n  ShoulderHarnessUsed varchar,\n  ToxicologyTestPerformed varchar,\n  childsub bigint NOT NULL DEFAULT $i(^Aviation.AircraftC("Crew"))\n);',
    'CREATE TABLE Aviation.Event (\n  ID bigint NOT NULL DEFAULT $i(^Aviation.EventD),\n  AirportDirection integer,\n  AirportDistance varchar,\n  AirportElevation integer,\n  AirportLocation varchar,\n  AirportName varchar,\n  Altimeter varchar,\n  EventDate timestamp,\n  EventId varchar NOT NULL,\n  EventTime integer,\n  FAADistrictOffice varchar,\n  InjuriesGroundFatal integer,\n  InjuriesGroundMinor integer,\n  InjuriesGroundSerious integer,\n  InjuriesHighest varchar,\n  InjuriesTotal integer,\n  InjuriesTotalFatal integer,\n  InjuriesTotalMinor integer,\n  InjuriesTotalNone integer,\n  InjuriesTotalSerious integer,\n  InvestigatingAgency varchar,\n  LightConditions varchar,\n  LocationCity varchar,\n  LocationCoordsLatitude double,\n  LocationCoordsLongitude double,\n  LocationCountry varchar,\n  LocationSiteZipCode varchar,\n  LocationState varchar,\n  MidAir varchar,\n  NTSBId varchar,\n  NarrativeCause varchar,\n  NarrativeFull varchar,\n  NarrativeSummary varchar,\n  OnGroundCollision varchar,\n  SkyConditionCeiling varchar,\n  SkyConditionCeilingHeight integer,\n  SkyConditionNonCeiling varchar,\n  SkyConditionNonCeilingHeight integer,\n  TimeZone varchar,\n  Type varchar,\n  Visibility varchar,\n  WeatherAirTemperature integer,\n  WeatherPrecipitation varchar,\n  WindDirection integer,\n  WindDirectionIndicator varchar,\n  WindGust integer,\n  WindGustIndicator varchar,\n  WindVelocity integer,\n  WindVelocityIndicator varchar\n);'
]

With these table definitions, we can proceed to the next step, which is to integrate them into our prompt for the LLM. This ensures that the LLM has accurate and comprehensive information about the database schema when generating SQL queries.

Selecting the Most Relevant Tables

When working with databases, especially larger ones, sending the Data Definition Language (DDL) for all tables in a prompt can be impractical. While this approach might work for small databases, real-world databases often contain hundreds or thousands of tables, making it inefficient to process all of them.

Moreover, it’s unlikely that a language model needs to be aware of every table in the database to generate SQL queries effectively. To address this challenge, we can leverage semantic search capabilities to select only the most relevant tables based on the user's query.

Approach

We achieve this by using semantic search with IRIS Vector Search. Note that this method is most effective if your SQL element identifiers (such as tables, fields, and keys) have meaningful names. If your identifiers are arbitrary codes, consider using a data dictionary instead.

Steps

  1. Retrieve Table Information

First, extract the table definitions into a pandas DataFrame:

# Retrieve table definitions into a pandas DataFrame
table_def = get_table_definitions_array(cnx=cnx, schema='Aviation')
table_df = pd.DataFrame(data=table_def, columns=["col_def"])
table_df["id"] = table_df.index + 1
table_df

The DataFrame (table_df) will look something like this:

col_defid
0CREATE TABLE Aviation.Aircraft (\n Event bigi...1
1CREATE TABLE Aviation.Crew (\n Aircraft varch...2
2CREATE TABLE Aviation.Event (\n ID bigint NOT...3
  1. Split Definitions into Documents

Next, split the table definitions into Langchain Documents. This step is crucial for handling large chunks of text and extracting text embeddings:

loader = DataFrameLoader(table_df, page_content_column="col_def")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=20, separator="\n")
tables_docs = text_splitter.split_documents(documents)
tables_docs

The resulting tables_docs list contains split documents with metadata, like so:

[Document(metadata={'id': 1}, page_content='CREATE TABLE Aviation.Aircraft (\n  Event bigint NOT NULL,\n  ID varchar NOT NULL,\n  ...'),
 Document(metadata={'id': 2}, page_content='CREATE TABLE Aviation.Crew (\n  Aircraft varchar NOT NULL,\n  ID varchar NOT NULL,\n  ...'),
 Document(metadata={'id': 3}, page_content='CREATE TABLE Aviation.Event (\n  ID bigint NOT NULL DEFAULT $i(^Aviation.EventD),\n  ...')]
  1. Extract Embeddings and Store in IRIS

Now, use the IRISVector class from langchain-iris to extract embedding vectors and store them:

tables_vector_store = IRISVector.from_documents(
    embedding=OpenAIEmbeddings(), 
    documents=tables_docs,
    connection_string=iris_conn_str,
    collection_name="sql_tables",
    pre_delete_collection=True
)

Note: The pre_delete_collection flag is set to True for demonstration purposes to ensure a fresh collection in each test run. In a production environment, this flag should generally be set to False.

  1. Find Relevant Documents

With the table embeddings stored, you can now query for relevant tables based on user input:

input_query = "List the first 2 manufacturers"
relevant_tables_docs = tables_vector_store.similarity_search(input_query, k=3)
relevant_tables_docs

For example, querying for manufacturers might return:

[Document(metadata={'id': 1}, page_content='GearType varchar,\n  LastInspectionDate timestamp,\n  ...'),
 Document(metadata={'id': 1}, page_content='AircraftModel varchar,\n  AircraftRegistrationClass varchar,\n  ...'),
 Document(metadata={'id': 3}, page_content='LocationSiteZipCode varchar,\n  LocationState varchar,\n  ...')]

From the metadata, you can see that only table ID 1 (Aviation.Aircraft) is relevant, which aligns with the query.

  1. Handling Edge Cases

While this approach is generally effective, it may not always be perfect. For instance, querying for crash sites might also return less relevant tables:

input_query = "List the top 10 most crash sites"
relevant_tables_docs = tables_vector_store.similarity_search(input_query, k=3)
relevant_tables_docs

Results might include:

[Document(metadata={'id': 3}, page_content='LocationSiteZipCode varchar,\n  LocationState varchar,\n  ...'),
 Document(metadata={'id': 3}, page_content='InjuriesGroundSerious integer,\n  InjuriesHighest varchar,\n  ...'),
 Document(metadata={'id': 1}, page_content='CREATE TABLE Aviation.Aircraft (\n  Event bigint NOT NULL,\n  ID varchar NOT NULL,\n  ...')]

Despite retrieving the correct Aviation.Event table twice, the Aviation.Aircraft table may also appear, which could be improved with additional filtering or thresholding. This is beyond the scope of this example and will be left for future implementations.

  1. Define a Function to Retrieve Relevant Tables

To automate this process, define a function to filter and return the relevant tables based on user input:

def get_relevant_tables(user_input, tables_vector_store, table_df):
    relevant_tables_docs = tables_vector_store.similarity_search(user_input)
    relevant_tables_docs_indices = [x.metadata["id"] for x in relevant_tables_docs]
    indices = table_df["id"].isin(relevant_tables_docs_indices)
    relevant_tables_array = [x for x in table_df[indices]["col_def"]]
    return relevant_tables_array

This function will help in efficiently retrieving only the relevant tables to send to the LLM, reducing the prompt length and improving overall query performance.

Selecting the Most Relevant Examples (Few-Shot Prompting)

When working with language models (LLMs), providing them with relevant examples helps ensure accurate and contextually appropriate responses. These examples, referred to as "few-shot" examples, guide the LLM in understanding the structure and context of the queries it should handle.

In our case, we need to populate the examples_value variable with a diverse set of SQL queries that cover a broad spectrum of IRIS SQL syntax and the tables available in the database. This helps prevent the LLM from generating incorrect or irrelevant queries.

Defining Example Queries

Below is a list of example queries designed to illustrate various SQL operations:

examples = [
    {"input": "List all aircrafts.", "query": "SELECT * FROM Aviation.Aircraft"},
    {"input": "Find all incidents for the aircraft with ID 'N12345'.", "query": "SELECT * FROM Aviation.Event WHERE EventId IN (SELECT EventId FROM Aviation.Aircraft WHERE ID = 'N12345')"},
    {"input": "List all incidents in the 'Commercial' operation type.", "query": "SELECT * FROM Aviation.Event WHERE EventId IN (SELECT EventId FROM Aviation.Aircraft WHERE OperationType = 'Commercial')"},
    {"input": "Find the total number of incidents.", "query": "SELECT COUNT(*) FROM Aviation.Event"},
    {"input": "List all incidents that occurred in 'Canada'.", "query": "SELECT * FROM Aviation.Event WHERE LocationCountry = 'Canada'"},
    {"input": "How many incidents are associated with the aircraft with AircraftKey 5?", "query": "SELECT COUNT(*) FROM Aviation.Aircraft WHERE AircraftKey = 5"},
    {"input": "Find the total number of distinct aircrafts involved in incidents.", "query": "SELECT COUNT(DISTINCT AircraftKey) FROM Aviation.Aircraft"},
    {"input": "List all incidents that occurred after 5 PM.", "query": "SELECT * FROM Aviation.Event WHERE EventTime > 1700"},
    {"input": "Who are the top 5 operators by the number of incidents?", "query": "SELECT TOP 5 OperatorName, COUNT(*) AS IncidentCount FROM Aviation.Aircraft GROUP BY OperatorName ORDER BY IncidentCount DESC"},
    {"input": "Which incidents occurred in the year 2020?", "query": "SELECT * FROM Aviation.Event WHERE YEAR(EventDate) = '2020'"},
    {"input": "What was the month with most events in the year 2020?", "query": "SELECT TOP 1 MONTH(EventDate) EventMonth, COUNT(*) EventCount FROM Aviation.Event WHERE YEAR(EventDate) = '2020' GROUP BY MONTH(EventDate) ORDER BY EventCount DESC"},
    {"input": "How many crew members were involved in incidents?", "query": "SELECT COUNT(*) FROM Aviation.Crew"},
    {"input": "List all incidents with detailed aircraft information for incidents that occurred in the year 2012.", "query": "SELECT e.EventId, e.EventDate, a.AircraftManufacturer, a.AircraftModel, a.AircraftCategory FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE Year(e.EventDate) = 2012"},
    {"input": "Find all incidents where there were more than 5 injuries and include the aircraft manufacturer and model.", "query": "SELECT e.EventId, e.InjuriesTotal, a.AircraftManufacturer, a.AircraftModel FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE e.InjuriesTotal > 5"},
    {"input": "List all crew members involved in incidents with serious injuries, along with the incident date and location.", "query": "SELECT c.CrewNumber AS 'Crew Number', c.Age, c.Sex AS Gender, e.EventDate AS 'Event Date', e.LocationCity AS 'Location City', e.LocationState AS 'Location State' FROM Aviation.Crew c JOIN Aviation.Event e ON c.EventId = e.EventId WHERE c.Injury = 'Serious'"}
]

Selecting Relevant Examples

Given the ever-expanding list of examples, it’s impractical to provide the LLM with all of them. Instead, we use IRIS Vector Search along with the SemanticSimilarityExampleSelector class to identify the most relevant examples based on user prompts.

Define the Example Selector:

example_selector = SemanticSimilarityExampleSelector.from_examples(
    examples,
    OpenAIEmbeddings(),
    IRISVector,
    k=5,
    input_keys=["input"],
    connection_string=iris_conn_str,
    collection_name="sql_samples",
    pre_delete_collection=True
)

Note: The pre_delete_collection flag is used here for demonstration purposes to ensure a fresh collection in each test run. In a production environment, this flag should be set to False to avoid unnecessary deletions.

Query the Selector:

To find the most relevant examples for a given input, use the selector as follows:

input_query = "Find all events in 2010 informing the Event Id and date, location city and state, aircraft manufacturer and model."
relevant_examples = example_selector.select_examples({"input": input_query})

The results might look like this:

[{'input': 'List all incidents with detailed aircraft information for incidents that occurred in the year 2012.', 'query': 'SELECT e.EventId, e.EventDate, a.AircraftManufacturer, a.AircraftModel, a.AircraftCategory FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE Year(e.EventDate) = 2012'},
 {'input': "Find all incidents for the aircraft with ID 'N12345'.", 'query': "SELECT * FROM Aviation.Event WHERE EventId IN (SELECT EventId FROM Aviation.Aircraft WHERE ID = 'N12345')"},
 {'input': 'Find all incidents where there were more than 5 injuries and include the aircraft manufacturer and model.', 'query': 'SELECT e.EventId, e.InjuriesTotal, a.AircraftManufacturer, a.AircraftModel FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE e.InjuriesTotal > 5'},
 {'input': 'List all aircrafts.', 'query': 'SELECT * FROM Aviation.Aircraft'},
 {'input': 'Find the total number of distinct aircrafts involved in incidents.', 'query': 'SELECT COUNT(DISTINCT AircraftKey) FROM Aviation.Aircraft'}]

If you specifically need examples related to quantities, you can query the selector accordingly:

input_query = "What is the number of incidents involving Boeing aircraft."
quantity_examples = example_selector.select_examples({"input": input_query})

The output may be:

[{'input': 'How many incidents are associated with the aircraft with AircraftKey 5?', 'query': 'SELECT COUNT(*) FROM Aviation.Aircraft WHERE AircraftKey = 5'},
 {'input': 'Find the total number of distinct aircrafts involved in incidents.', 'query': 'SELECT COUNT(DISTINCT AircraftKey) FROM Aviation.Aircraft'},
 {'input': 'How many crew members were involved in incidents?', 'query': 'SELECT COUNT(*) FROM Aviation.Crew'},
 {'input': 'Find all incidents where there were more than 5 injuries and include the aircraft manufacturer and model.', 'query': 'SELECT e.EventId, e.InjuriesTotal, a.AircraftManufacturer, a.AircraftModel FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE e.InjuriesTotal > 5'},
 {'input': 'List all incidents with detailed aircraft information for incidents that occurred in the year 2012.', 'query': 'SELECT e.EventId, e.EventDate, a.AircraftManufacturer, a.AircraftModel, a.AircraftCategory FROM Aviation.Event e JOIN Aviation.Aircraft a ON e.EventId = a.EventId WHERE Year(e.EventDate) = 2012'}]

This output includes examples that specifically address counting and quantities.

Future Considerations

While the SemanticSimilarityExampleSelector is powerful, it’s important to note that not all selected examples may be perfect. Future improvements may involve adding filters or thresholds to exclude less relevant results, ensuring that only the most appropriate examples are provided to the LLM.

Accuracy Test

To assess the performance of the prompt and SQL query generation, we need to set up and run a series of tests. The goal is to evaluate how well the LLM generates SQL queries based on user inputs, with and without the use of example-based few shots.

Function to Generate SQL Queries

We start by defining a function that uses the LLM to generate SQL queries based on the provided context, prompt, user input, and other parameters:

def get_sql_from_text(context, prompt, user_input, use_few_shots, tables_vector_store, table_df, example_selector=None, example_prompt=None):
    relevant_tables = get_relevant_tables(user_input, tables_vector_store, table_df)
    context["table_info"] = "\n\n".join(relevant_tables)

    examples = example_selector.select_examples({"input": user_input}) if example_selector else []
    context["examples_value"] = "\n\n".join([
        example_prompt.invoke(x).to_string() for x in examples
    ])
    
    model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
    output_parser = StrOutputParser()
    chain_model = prompt | model | output_parser
    
    response = chain_model.invoke({
        "top_k": context["top_k"],
        "table_info": context["table_info"],
        "examples_value": context["examples_value"],
        "input": user_input
    })
    return response

Execute the Prompt

Test the prompt with and without examples:

# Prompt execution **with** few shots
input = "Find all events in 2010 informing the Event Id and date, location city and state, aircraft manufacturer and model."
response_with_few_shots = get_sql_from_text(
    context, 
    prompt, 
    user_input=input, 
    use_few_shots=True, 
    tables_vector_store=tables_vector_store, 
    table_df=table_df,
    example_selector=example_selector, 
    example_prompt=example_prompt,
)
print(response_with_few_shots)
SELECT e.EventId, e.EventDate, e.LocationCity, e.LocationState, a.AircraftManufacturer, a.AircraftModel
FROM Aviation.Event e
JOIN Aviation.Aircraft a ON e.EventId = a.EventId
WHERE Year(e.EventDate) = 2010
# Prompt execution **without** few shots
input = "Find all events in 2010 informing the Event Id and date, location city and state, aircraft manufacturer and model."
response_with_no_few_shots = get_sql_from_text(
    context, 
    prompt, 
    user_input=input, 
    use_few_shots=False, 
    tables_vector_store=tables_vector_store, 
    table_df=table_df,
)
print(response_with_no_few_shots)
SELECT TOP 3 "EventId", "EventDate", "LocationCity", "LocationState", "AircraftManufacturer", "AircraftModel"
FROM Aviation.Event e
JOIN Aviation.Aircraft a ON e.ID = a.Event
WHERE e.EventDate >= '2010-01-01' AND e.EventDate < '2011-01-01'

Utility Functions for Testing

To test the generated SQL queries, we define some utility functions:

def execute_sql_query(cnx, query):
    try:
        cursor = cnx.cursor()
        cursor.execute(query)
        rows = cursor.fetchall()
        return rows
    except:
        print('Error running query:')
        print(query)
        print('-'*80)
    return None

def sql_result_equals(cnx, query, expected):
    rows = execute_sql_query(cnx, query)
    result = [set(row._asdict().values()) for row in rows or []]
    if result != expected and rows is not None:
        print('Result not as expected for query:')
        print(query)
        print('-'*80)
    return result == expected
# SQL test for prompt **with** few shots
print("SQL is OK" if not execute_sql_query(cnx, response_with_few_shots) is None else "SQL is not OK")
    SQL is OK
# SQL test for prompt **without** few shots
print("SQL is OK" if not execute_sql_query(cnx, response_with_no_few_shots) is None else "SQL is not OK")
    error on running query: 
    SELECT TOP 3 "EventId", "EventDate", "LocationCity", "LocationState", "AircraftManufacturer", "AircraftModel"
    FROM Aviation.Event e
    JOIN Aviation.Aircraft a ON e.ID = a.Event
    WHERE e.EventDate >= '2010-01-01' AND e.EventDate < '2011-01-01'
    --------------------------------------------------------------------------------
    SQL is not OK

Define and Execute Tests

Define a set of test cases and run them:

tests = [{
    "input": "What were the top 3 years with the most recorded events?",
    "expected": [{128, 2003}, {122, 2007}, {117, 2005}]
},{
    "input": "How many incidents involving Boeing aircraft.",
    "expected": [{5}]
},{
    "input": "How many incidents that resulted in fatalities.",
    "expected": [{237}]
},{
    "input": "List event Id and date and, crew number, age and gender for incidents that occurred in 2013.",
    "expected": [{1, datetime.datetime(2013, 3, 4, 11, 6), '20130305X71252', 59, 'M'},
                 {1, datetime.datetime(2013, 1, 1, 15, 0), '20130101X94035', 32, 'M'},
                 {2, datetime.datetime(2013, 1, 1, 15, 0), '20130101X94035', 35, 'M'},
                 {1, datetime.datetime(2013, 1, 12, 15, 0), '20130113X42535', 25, 'M'},
                 {2, datetime.datetime(2013, 1, 12, 15, 0), '20130113X42535', 34, 'M'},
                 {1, datetime.datetime(2013, 2, 1, 15, 0), '20130203X53401', 29, 'M'},
                 {1, datetime.datetime(2013, 2, 15, 15, 0), '20130218X70747', 27, 'M'},
                 {1, datetime.datetime(2013, 3, 2, 15, 0), '20130303X21011', 49, 'M'},
                 {1, datetime.datetime(2013, 3, 23, 13, 52), '20130326X85150', 'M', None}]
},{
    "input": "Find the total number of incidents that occurred in the United States.",
    "expected": [{1178}]
},{
    "input": "List all incidents latitude and longitude coordinates with more than 5 injuries that occurred in 2010.",
    "expected": [{-78.76833333333333, 43.25277777777778}]
},{
    "input": "Find all incidents in 2010 informing the Event Id and date, location city and state, aircraft manufacturer and model.",
    "expected": [
        {datetime.datetime(2010, 5, 20, 13, 43), '20100520X60222', 'CIRRUS DESIGN CORP', 'Farmingdale', 'New York', 'SR22'},
        {datetime.datetime(2010, 4, 11, 15, 0), '20100411X73253', 'CZECH AIRCRAFT WORKS SPOL SRO', 'Millbrook', 'New York', 'SPORTCRUISER'},
        {'108', datetime.datetime(2010, 1, 9, 12, 55), '20100111X41106', 'Bayport', 'New York', 'STINSON'},
        {datetime.datetime(2010, 8, 1, 14, 20), '20100801X85218', 'A185F', 'CESSNA', 'New York', 'Newfane'}
    ]
}]

Accuracy Evaluation

Run the tests and calculate the accuracy:

def execute_tests(cnx, context, prompt, use_few_shots, tables_vector_store, table_df, example_selector, example_prompt):
    tests_generated_sql = [(x, get_sql_from_text(
            context, 
            prompt, 
            user_input=x['input'], 
            use_few_shots=use_few_shots, 
            tables_vector_store=tables_vector_store, 
            table_df=table_df,
            example_selector=example_selector if use_few_shots else None, 
            example_prompt=example_prompt if use_few_shots else None,
        )) for x in deepcopy(tests)]
    
    tests_sql_executions = [(x[0], sql_result_equals(cnx, x[1], x[0]['expected'])) 
                            for x in tests_generated_sql]
    
    accuracy = sum(1 for i in tests_sql_executions if i[1] == True) / len(tests_sql_executions)
    print(f'Accuracy: {accuracy}')
    print('-'*80)

Results

# Accuracy tests for prompts executed **without** few shots
use_few_shots = False
execute_tests(
    cnx,
    context, 
    prompt, 
    use_few_shots, 
    tables_vector_store, 
    table_df, 
    example_selector, 
    example_prompt
)
    error on running query: 
    SELECT "EventDate", COUNT("EventId") as "TotalEvents"
    FROM Aviation.Event
    GROUP BY "EventDate"
    ORDER BY "TotalEvents" DESC
    TOP 3;
    --------------------------------------------------------------------------------
    error on running query: 
    SELECT "EventId", "EventDate", "C"."CrewNumber", "C"."Age", "C"."Sex"
    FROM "Aviation.Event" AS "E"
    JOIN "Aviation.Crew" AS "C" ON "E"."ID" = "C"."EventId"
    WHERE "E"."EventDate" >= '2013-01-01' AND "E"."EventDate" < '2014-01-01'
    --------------------------------------------------------------------------------
    result not expected for query: 
    SELECT TOP 3 "e"."EventId", "e"."EventDate", "e"."LocationCity", "e"."LocationState", "a"."AircraftManufacturer", "a"."AircraftModel"
    FROM "Aviation"."Event" AS "e"
    JOIN "Aviation"."Aircraft" AS "a" ON "e"."ID" = "a"."Event"
    WHERE "e"."EventDate" >= '2010-01-01' AND "e"."EventDate" < '2011-01-01'
    --------------------------------------------------------------------------------
    accuracy: 0.5714285714285714
    --------------------------------------------------------------------------------
# Accuracy tests for prompts executed **with** few shots
use_few_shots = True
execute_tests(
    cnx,
    context, 
    prompt, 
    use_few_shots, 
    tables_vector_store, 
    table_df, 
    example_selector, 
    example_prompt
)
    error on running query: 
    SELECT e.EventId, e.EventDate, e.LocationCity, e.LocationState, a.AircraftManufacturer, a.AircraftModel
    FROM Aviation.Event e
    JOIN Aviation.Aircraft a ON e.EventId = a.EventId
    WHERE Year(e.EventDate) = 2010 TOP 3
    --------------------------------------------------------------------------------
    accuracy: 0.8571428571428571
    --------------------------------------------------------------------------------

Conclusion

The accuracy of SQL queries generated with examples (few shots) is approximately 49% higher compared to those generated without examples (85% vs. 57%).

References

9
5 618