Xerini: Minimally orchestrates `SQL` scripts¶

Toy Example¶

Suppose you have the following script in a file called simple.sql

drop table if exists my_data;
create table my_data as
    select name, age, gender
    from demographics;

You can use sciurus to write the following python script:

In[1]:
from xerini.script import Script

In[2]: code = Script.from_file("simple.sql")

In[3]:
for stmt in code.statements:
    ...: print(f"the set {stmt.source_tables} sends data to {stmt.affected_table}.")
    ...:
the
set
set()
sends
data
to
my_data.
the
set
{'demographics'}
sends
data
to
my_data.

For a given SQL script, sciurus will split it into statements, and for each statement identify the sources tables and the affected tables.
Running

In [4]: dot_file = code.write_dot("simple.dot")

In [5]: svg_file = code.write_svg(dot_file)

will produce

`simple.dot.svg`

Another Example¶

We can do the same thing using a slightly more elaborated script generated by GoogleGemini:

-- Large CTAS (CREATE TABLE AS SELECT) SQL Script Example: Data Warehouse Transformation

-- This script simulates a complex data transformation and aggregation for a data warehouse.
-- It involves multiple joins, aggregations, window functions, and data type conversions.

-- Assuming we have source tables:
--   - sales_transactions (transaction_id, customer_id, product_id, sale_date, sale_amount, quantity)
--   - customers (customer_id, customer_name, region, signup_date)
--   - products (product_id, product_name, category, price)
--   - date_dim (date_id, full_date, year, month, day, day_of_week)

CREATE TABLE aggregated_sales_summary AS
SELECT
    dd.year,
    dd.month,
    c.region,
    p.category,
    SUM(st.sale_amount) AS total_sales,
    AVG(st.sale_amount) AS average_sale,
    COUNT(DISTINCT st.customer_id) AS unique_customers,
    COUNT(st.transaction_id) AS total_transactions,
    SUM(st.quantity) AS total_quantity,
    MAX(st.sale_date) AS latest_sale_date,
    MIN(st.sale_date) AS earliest_sale_date,
    -- Window functions for ranking and running totals
    RANK() OVER (PARTITION BY dd.year, dd.month, c.region ORDER BY SUM(st.sale_amount) DESC) AS sales_rank,
    SUM(SUM(st.sale_amount)) OVER (PARTITION BY dd.year, c.region ORDER BY dd.month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total_sales,
    -- Calculated columns
    (SUM(st.sale_amount) / COUNT(DISTINCT st.customer_id)) AS average_sales_per_customer,
    CAST(AVG(p.price) AS DECIMAL(10, 2)) AS average_product_price,
    COUNT(DISTINCT c.customer_id) FILTER (WHERE c.signup_date >= DATE_TRUNC('year', dd.full_date)) AS new_customers_this_year,
    --conditional aggregation
    SUM(case when p.category = 'Electronics' then st.sale_amount else 0 end) as electronics_sales,
    SUM(case when p.category = 'Clothing' then st.sale_amount else 0 end) as clothing_sales
FROM
    sales_transactions st
JOIN
    customers c ON st.customer_id = c.customer_id
JOIN
    products p ON st.product_id = p.product_id
JOIN
    date_dim dd ON st.sale_date = dd.full_date
WHERE
    dd.year >= EXTRACT(YEAR FROM CURRENT_DATE) - 3 -- Example: Analyze last 3 years
    AND st.sale_date BETWEEN '2020-01-01' and '2024-12-31'
GROUP BY
    dd.year, dd.month, c.region, p.category
ORDER BY
    dd.year, dd.month, c.region, p.category;

-- Optionally, add indexes or constraints to the newly created table:
-- CREATE INDEX idx_sales_summary_year_month ON aggregated_sales_summary (year, month);
-- ALTER TABLE aggregated_sales_summary ADD CONSTRAINT pk_sales_summary PRIMARY KEY (year, month, region, category);

-- Example of a second CTAS table, building on the first.
CREATE TABLE monthly_region_sales_summary AS
SELECT
    year,
    month,
    region,
    SUM(total_sales) as total_regional_sales,
    AVG(average_sale) as avg_regional_sale,
    COUNT(category) as number_of_categories
FROM aggregated_sales_summary
GROUP BY year, month, region
ORDER BY year, month, region;

-- Example of a third CTAS table, creating a customer level summary.
CREATE TABLE customer_sales_summary AS
SELECT
    c.customer_id,
    c.customer_name,
    c.region,
    min(st.sale_date) as first_purchase,
    max(st.sale_date) as last_purchase,
    sum(st.sale_amount) as total_customer_sales,
    count(st.transaction_id) as total_customer_transactions
FROM sales_transactions st
JOIN customers c on st.customer_id = c.customer_id
group by c.customer_id, c.customer_name, c.region;

In[1]:
from xerini.script import Script

In[2]: code = Script.from_file("gemini_example.sql")

In[3]: dot_file = code.write_dot("gemini_example.dot")

In[4]: svg_file = code.write_svg(dot_file)

In[5]: code.keys()
Out[5]:
{'aggregated_sales_summary',
 'customer_sales_summary',
 'monthly_region_sales_summary'}

In[6]:
for stmt in code.statements:
    ...: print(f"the set {stmt.source_tables} sends data to {stmt.affected_table}.")
    ...:
the
set
{'sales_transactions', 'customers', 'date_dim', 'products'}
sends
data
to
aggregated_sales_summary.
the
set
{'aggregated_sales_summary'}
sends
data
to
monthly_region_sales_summary.
the
set
{'sales_transactions', 'customers'}
sends
data
to
customer_sales_summary

`gemini_example.dot.svg`

The Script object quacks like a python dictionary in which thekeys are the tables produced by the statements in the script,

In [7]: code.keys()
Out[7]:
{'aggregated_sales_summary',
 'customer_sales_summary',
 'monthly_region_sales_summary'}

and the values are the lists of statements affecting the key-table.

In [8]: print(code['customer_sales_summary'][0])
-- Example of a third CTAS table, creating a customer level summary.
CREATE TABLE customer_sales_summary AS
SELECT
    c.customer_id,
    c.customer_name,
    c.region,
    min(st.sale_date) as first_purchase,
    max(st.sale_date) as last_purchase,
    sum(st.sale_amount) as total_customer_sales,
    count(st.transaction_id) as total_customer_transactions
FROM sales_transactions st
JOIN customers c on st.customer_id = c.customer_id
group by c.customer_id, c.customer_name, c.region;

Finally, the Script object has a method called stage_decomposition that allows you to break the SQL script into stages, i.e. subsets of the SQL script that can be executed concurrently.

In [9]: stages = code.stage_decomposition()

In [10]: print(stages)
[{'customer_sales_summary', 'aggregated_sales_summary'}, {'monthly_region_sales_summary'}]

Code Structure¶

Logo	dependency

Below is some details on the nature of the involved objects:

Class Diagrams

Sciurus is a python library for dealing with large sets of SQL scripts.

Xerini: Minimally orchestrates SQL scripts¶

Toy Example¶

Another Example¶

Code Structure¶

Xerini: Minimally orchestrates `SQL` scripts¶