DuckDB: an efficient SQL development in Python

Posted on January 19, 2025


After a few weeks using DuckDb to develop product, particularly to handle complex data processing logic (e.g. multiple table join, condition, group, expand, cross-database operation, ...), I like it more and more. Before that I use psycopg2 to connect database and download to Pandas, then use Pandas to do data processing, and finally write processed data to database table. For simple data processing logic, it is easy. But when need doing complex processing logic and join tables from different database and database servers, it is becoming tedious and inefficient using Pandas. I would like a tool that can reduce the pain. That is DuckDB I find it.

DuckDB is quite similar to my previous Database dev environment DataWork of AliCloud. DuckDB is Python friendly. So I use it to develop my user profiling product, which is heavily data logic processing such as join, pivot, expand, group on many tables and cross database server. Besides sufficient inner functions, you can also self-define Python function, register, and use it in DuckDB SQL. It is quite amazing. Here I will show a few tips on connect to Postgres server, create table, and insert table examples.

Install

pip install duckdb

Connect to Postgres database

```con_local``` is the variable what you use to access hereafter

# install postgres adaptor
duckdb.sql("INSTALL postgres; LOAD postgres;")

# Set your Database secrets, create for each database connector using different secret_name. You set can ready only or write/read allowed

_secret_template = """
CREATE SECRET IF NOT EXISTS {secret_name} (
            TYPE POSTGRES,
            HOST '{host}',
            PORT {port},
            DATABASE {database},
            USER '{username}',
            PASSWORD '{password}'
        );
 ATTACH IF NOT EXISTS '' AS {conn_alias} (TYPE POSTGRES, SECRET {secret_name}, SCHEMA 'public', READ_ONLY);
        """

 _secret_template_wr = """CREATE SECRET IF NOT EXISTS {secret_name} (
            TYPE POSTGRES,
            HOST '{host}',
            PORT {port},
            DATABASE {database},
            USER '{username}',
            PASSWORD '{password}'
        );
 ATTACH IF NOT EXISTS '' AS {conn_alias} (TYPE POSTGRES, SECRET {secret_name}, SCHEMA 'public');
        """
 duckdb.sql(self._secret_template_wr.format(
            username = "postgres",
            password = "postgres",
            host = "localhost",
            port = 5432,
            database = "postgres",
            secret_name = "secret_local",
            conn_alias = "con_local",
            )
        )

Access the table in database

# Read table
# e.f. if a table, named user_accounts, in postgres database. Columns e.g. user_id, name, location

sql_str = """ SELECT user_id, name, location FROM con_local.user_accounts;"""
data = duckdb.sql(sql_str)

#write table
sql_str = """ INSERT INTO con_local.user_accounts
SELECT user_id, name, location FROM (VALUES (1,'a','a1'))
"""
data = duckdb.sql(sql_str)

Register a function

The following shows an example case of how to define custom function and call it in SQL to process user content.

def word_split(msg) -> str:
        words = set(map(lambda x: x.strip("-*[]\n"), msg.split(" ")))

        return ",".join(words)

duckdb.create_function("my_word_split", word_split, [VARCHAR], VARCHAR)

data = duckdb.sql("""
SELECT user_id,user_message
       my_word_split(user_message) AS word_list,
       created_at,
       current_date as updated_at
FROM con_local.user_content
""")

More detailed tutorial refer DuckDB official document.