-
Notifications
You must be signed in to change notification settings - Fork 64
Open
Labels
Description
What happens?
Note: This was tested using DuckDB with several extensions loaded (cache_httpfs, parquet, DuckLake, observefs)
After a change we started using prepared statements (in particular we are using user input on our queries and wanted to protect our data against SQL inyection attacks) we noticed a severe slow down on our system. Tracing the issue we figured out that our DuckDB SQL statements were 10 times more slow after the patch, and after testing the queries with and without using params our conclusion is that using params is ten times slower than a simple query
To Reproduce
#DuckDB connection creation
self.connections[thread_id] = duckdb.connect(database)
self.connections[thread_id].install_extension("parquet")
self.connections[thread_id].install_extension("observefs", repository="community")
self.connections[thread_id].install_extension("cache_httpfs", repository="community")
self.connections[thread_id].install_extension("ducklake")
self.connections[thread_id].install_extension("postgres")
self.connections[thread_id].load_extension("ducklake")
self.connections[thread_id].load_extension("cache_httpfs")
self.connections[thread_id].load_extension("observefs")
self.connections[thread_id].load_extension("postgres")
self.connections[thread_id].execute("SET cache_httpfs_type='on_disk'")
self.connections[thread_id].execute("SET cache_httpfs_cache_directory='/tmp/general_cache_httpfs/'")
self.connections[thread_id].execute("SET cache_httpfs_disk_cache_reader_enable_memory_cache=true")
self.connections[thread_id].execute("SET enable_external_file_cache=true")
self.connections[thread_id].execute("SET enable_http_metadata_cache=true")
self.connections[thread_id].load_extension("parquet")
self.connections[thread_id].execute("CREATE OR REPLACE SECRET secret (" \
+"TYPE s3," \
+"PROVIDER config," \
+"KEY_ID '" + os.environ.get("AWS_S3_ACCESS_KEY_ID", "") + "'," \
+"SECRET '" + os.environ.get("AWS_S3_SECRET_ACCESS_KEY", "") + "'," \
+"REGION '" + os.environ.get("AWS_REGION", "us-east-1") + "'" \
+")"
)
self.connections[thread_id].execute("CREATE OR REPLACE SECRET ducklake_postgres (" \
+"TYPE postgres," \
+"HOST '" + settings.DATABASES['default']['HOST'] + "'," \
+"DATABASE '" + settings.DATABASES['default']['NAME'] + "'," \
+"USER '" + settings.DATABASES['default']['USER'] + "'," \
+"PASSWORD '" + settings.DATABASES['default']['PASSWORD'] + "'," \
+"PORT " + settings.DATABASES['default']['PORT'] \
+")"
)
self.connections[thread_id].execute("CREATE OR REPLACE SECRET columns_ducklake (" \
+"TYPE ducklake," \
+f"DATA_PATH 's3://" + settings.AWS_STORAGE_BUCKET_NAME + f"/{os.environ.get("DUCKLAKE_FOLDER", "ducklake")}/'," \
+"METADATA_PATH ''," \
+"METADATA_PARAMETERS MAP { "\
+"'TYPE': 'postgres',"\
+"'SECRET': 'ducklake_postgres',"\
+"}" \
+")"
)
self.connections[thread_id].execute(f"ATTACH 'ducklake:columns_ducklake' AS datalake")
self.connections[thread_id].execute("USE datalake")
self.connections[thread_id].execute("CALL datalake.set_option('delete_older_than', '1d')")
self.connections[thread_id].execute("CALL datalake.set_option('expire_older_than', '1d')")
# where the SQL query is a simple filter over a big parquet file on AWS S3 (200MB) or a ducklake table with S3 as backend partitioned by chunks (same big parquet file but uploaded using ducklake)
filtered_numpy_dict = duckdbConn.sql(sql_query).fetchnumpy()
#This takes 0.9 seconds
# Same SQL query but now with a prepared statement and a dictionary with all values (user input mostly)
filtered_numpy_dict = duckdbConn.sql(sql_query, params=values_dict).fetchnumpy()
#This takes 10 seconds
OS:
Linux/AMD64 Debian Trixie
DuckDB Version:
1.4.4
DuckDB Client:
Python
Hardware:
Docker running on AWS ECS, Mac M4
Full Name:
Fabián Alexis Bravo Abarca
Affiliation:
AlicantoLabs
Did you include all relevant configuration (e.g., CPU architecture, Linux distribution) to reproduce the issue?
- Yes, I have
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant data sets for reproducing the issue?
No - I cannot share the data sets because they are confidential
Reactions are currently unavailable