Feature/snowflake s3 stage operations by abhishek-pattern · Pull Request #18 · patterninc/ds-platform-utils

abhishek-pattern · 2026-02-18T09:27:08Z

No description provided.

- Add query_pandas_from_snowflake_via_s3_stage() for efficient large query results (>10M rows) - Add publish_pandas_via_s3_stage() for efficient large DataFrame writes (>10M rows) - Add make_batch_predictions_from_snowflake_via_s3_stage() for batch ML predictions - Support dev/prod environment switching via current.is_production - Add helper functions for S3 operations and SQL generation - Add metaflow_s3/utils.py with S3 utility functions - Add comprehensive functional tests - Integrate with existing Metaflow card system and cost tracking

… functional tests

… generation

…r schema from DataFrame

…te table creation logic

…function for improved readability and maintainability

…for improved clarity and performance

…dation

…in batch inference

…using S3 file retrieval function

…ient assignment for improved clarity

…for improved performance

…ocessing

…tegrate with multiprocessing

…erformance

…roved parallel processing

…rallel_map

…r for improved parallel processing

…ved concurrency in batch inference

…e and temporary file handling

…e function

…remove unused functions

… improve logging

…ction

…dule

Copilot

Pull request overview

Adds an S3-stage based path for moving data between Pandas and Snowflake in Metaflow flows (via Snowflake COPY INTO + S3 parquet), and updates CI to run tests in parallel.

Changes:

Add S3/Snowflake stage utilities (copy_snowflake_to_s3, copy_s3_to_snowflake) and S3 dataframe read/write helpers.
Extend publish_pandas / query_pandas_from_snowflake with a use_s3_stage option.
Add a functional Metaflow test for the S3-stage flow and enable pytest-xdist in CI.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 19 comments.

Show a summary per file

File	Description
tests/functional_tests/metaflow/test__pandas_s3.py	New functional flow test exercising `publish_pandas`/`query_pandas_from_snowflake` via S3-stage.
src/ds_platform_utils/metaflow/s3_stage.py	New Snowflake↔S3 stage COPY helpers and schema inference.
src/ds_platform_utils/metaflow/s3.py	New Metaflow S3 client helpers for parquet IO and folder chunking.
src/ds_platform_utils/metaflow/pandas.py	Adds `use_s3_stage` option for publish/query and updates schema selection to `DEV_SCHEMA`.
src/ds_platform_utils/metaflow/batch_inference.py	New batch inference pipeline leveraging S3 as an intermediate store.
src/ds_platform_utils/metaflow/_consts.py	Replaces `NON_PROD_SCHEMA` with `DEV_SCHEMA`; adds S3 stage constants.
src/ds_platform_utils/_snowflake/write_audit_publish.py	Updates to new `DEV_SCHEMA` constant.
pyproject.toml	Version bump to 0.4.0; adds `pytest-xdist`.
.github/workflows/ci-cd-ds-platform-utils.yaml	Runs pytest with xdist (`-n auto`).

Comments suppressed due to low confidence (1)

src/ds_platform_utils/metaflow/pandas.py:208

Docstring says non-prod schema will be NON_PROD_SCHEMA, but that constant no longer exists and the code uses DEV_SCHEMA. Update this note to match the new constant naming.

    **NOTE:** If the query contains `{schema}` placeholders, they will be replaced with the appropriate schema name.
    The schema name will be determined based on the current environment:
    - If in production, it will be set to `PROD_SCHEMA`.
    - If not in production, it will be set to `NON_PROD_SCHEMA`.
    - If the query does not contain `{schema}` placeholders, the schema name will not be modified.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-18T09:33:58Z

src/ds_platform_utils/metaflow/s3.py

+
+import pandas as pd
+from metaflow import S3, current
+
+
+def _get_metaflow_s3_client():
+    return S3(role="arn:aws:iam::209479263910:role/outerbounds_iam_role")
+
+


The S3 client role ARN is hardcoded inside the library. This makes the package environment/account-specific and hard to reuse or test. Consider moving it to configuration (env var / Metaflow config) or at least into _consts.py so it can be overridden per environment.

Suggested change

import pandas as pd

from metaflow import S3, current

def _get_metaflow_s3_client():

return S3(role="arn:aws:iam::209479263910:role/outerbounds_iam_role")

import os

import pandas as pd

from metaflow import S3, current

def _get_metaflow_s3_client():

"""

Return a Metaflow S3 client.

The role ARN can be configured via the METAFLOW_S3_ROLE_ARN environment variable.

If not set, the default Metaflow credential resolution is used.

"""

role_arn = os.getenv("METAFLOW_S3_ROLE_ARN")

if role_arn:

return S3(role=role_arn)

return S3()

Copilot · 2026-02-18T09:33:58Z

src/ds_platform_utils/metaflow/s3_stage.py

The parameter name table_defination is misspelled. Since this is part of the public API for S3-stage loading, consider renaming it consistently across copy_s3_to_snowflake, _generate_s3_to_snowflake_copy_query, and publish_pandas (e.g., table_definition or table_schema) to avoid propagating the typo.

Suggested change

table_defination: Optional[List[Tuple[str, str]]] = None,

table_definition: Optional[List[Tuple[str, str]]] = None,

Copilot · 2026-02-18T09:33:59Z

src/ds_platform_utils/metaflow/pandas.py

    conn: SnowflakeConnection = get_snowflake_connection(use_utc)
-
-    # set warehouse
    if warehouse is not None:
        _execute_sql(conn, f"USE WAREHOUSE {warehouse};")
+    _execute_sql(conn, f"USE SCHEMA PATTERN_DB.{schema};")
+
+    if use_s3_stage:
+        s3_bucket, _ = _get_s3_config(current.is_production)
+        data_folder = "publish_" + str(pd.Timestamp.now().strftime("%Y%m%d_%H%M%S_%f"))
+        s3_path = f"{s3_bucket}/{S3_DATA_FOLDER}/{data_folder}"


When use_s3_stage=True, a Snowflake connection is opened and USE WAREHOUSE/SCHEMA is executed, but the connection is not used for the actual load (which opens its own connection in copy_s3_to_snowflake). Consider creating the connection only in the write_pandas branch (or reusing the same connection in copy_s3_to_snowflake) to avoid an unnecessary connection open/close.

Copilot · 2026-02-18T09:33:59Z

src/ds_platform_utils/metaflow/pandas.py

The public API parameter is named table_defination (misspelled) but the docstring below refers to table_schema, and callers/tests are using table_schema. Rename this parameter consistently (e.g., table_schema or table_definition) and update the call into copy_s3_to_snowflake accordingly.

Suggested change

table_defination: Optional[List[Tuple[str, str]]] = None,

table_schema: Optional[List[Tuple[str, str]]] = None,

Copilot · 2026-02-18T09:33:59Z

src/ds_platform_utils/metaflow/s3.py

+        with tempfile.NamedTemporaryFile(
+            prefix=str(Path(current.tempdir).absolute()) + "/",  # type: ignore
+            suffix=".parquet",
+        ) as tmp_file:


NamedTemporaryFile(prefix=str(Path(current.tempdir).absolute()) + "/") uses prefix to try to control the directory, but prefix is part of the filename and including path separators can cause invalid paths at runtime. Use the dir= parameter to place the temp file in current.tempdir and keep prefix as a simple name prefix.

src/ds_platform_utils/metaflow/s3_stage.py

Copilot · 2026-02-18T09:34:02Z

src/ds_platform_utils/metaflow/s3_stage.py

TRUNCATE TABLE IF EXISTS is not valid Snowflake syntax. If you need conditional behavior, check existence first; otherwise use TRUNCATE TABLE <name>; (or DELETE FROM <name>; if you need transactional semantics).

Suggested change

print(f"Generated TRUNCATE TABLE query:\nTRUNCATE TABLE IF EXISTS {table_name};")

sql_statements.append(f"TRUNCATE TABLE IF EXISTS {table_name};")

print(f"Generated TRUNCATE TABLE query:\nTRUNCATE TABLE {table_name};")

sql_statements.append(f"TRUNCATE TABLE {table_name};")

Copilot · 2026-02-18T09:34:03Z

src/ds_platform_utils/metaflow/s3_stage.py

+    FILE_FORMAT = (TYPE = 'parquet')
+    MAX_FILE_SIZE = {max_file_size}
+    SINGLE = {single}
+    HEADER = TRUE


HEADER = TRUE is a CSV-specific unload option and is not applicable to Parquet exports. Remove it from the COPY INTO statement to avoid Snowflake rejecting the query or silently ignoring an irrelevant setting.

Suggested change

HEADER = TRUE

Copilot · 2026-02-18T09:34:03Z

src/ds_platform_utils/metaflow/batch_inference.py

_generate_snowflake_to_s3_copy_query is called with batch_size_in_mb=..., but the current helper signature (in s3_stage.py) only accepts query and snowflake_stage_path. Either update the helper to support batch sizing, or remove this argument and adjust file sizing via MAX_FILE_SIZE logic in the helper.

src/ds_platform_utils/metaflow/s3_stage.py

…in test cases

…nctions

…inference with S3 and Snowflake

…nce with S3 and Snowflake

…_ids for foreach processing

…rocessing and improve batch handling

…sing and publishing

…chema inference

…ndling and initialization

…ne for consistency

…onsistent path formatting

…rate tracking

…rocessing

…ted in BatchInferencePipeline

…essing

…tegration

…bility in batch inference and S3 operations

abhishek-pattern added 30 commits February 5, 2026 10:43

feat: implement S3 stage operations for Snowflake integration and add…

2afeb6e

… functional tests

feat: enhance Snowflake to S3 operations with configuration and query…

90be5e7

… generation

feat: add S3 to Snowflake data transfer functionality

4384700

feat: add S3 data folder constant for Snowflake stage operations

179c950

refactor: remove outdated functional tests for pandas via S3 stage

e1c4a0e

feat: add batch inference function for Snowflake integration and infe…

09c172e

…r schema from DataFrame

fix: add missing import for listing files in S3 for batch inference

ed96149

feat: add batch size parameter to Snowflake to S3 copy query and upda…

fa9eb51

…te table creation logic

feat: enhance batch inference with multiprocessing for file processing

3366ead

feat: refactor file processing in batch inference to use a dedicated …

535cc71

…function for improved readability and maintainability

feat: refactor file processing in batch inference to inline function …

58b52f4

…for improved clarity and performance

feat: add function to download all files from S3 folder with URI vali…

bac2b1d

…dation

feat: replace file listing with direct download of all files from S3 …

ea0544c

…in batch inference

feat: update input file processing to read parquet format instead of …

bb65911

…using S3 file retrieval function

feat: refactor _download_all_files_in_s3_folder to use a direct S3 cl…

43bfb41

…ient assignment for improved clarity

feat: optimize batch inference file processing using multiprocessing …

172278a

…for improved performance

feat: add picklability check for process_file function before multipr…

82f30af

…ocessing

feat: refactor process_file function for improved picklability and in…

4b8712d

…tegrate with multiprocessing

feat: replace itertools.partial with functools.partial for improved p…

b7624cc

…erformance

feat: replace multiprocessing.Pool with metaflow.parallel_map for imp…

ccc1e29

…roved parallel processing

feat: refactor process_file function for improved integration with pa…

c7f9b5b

…rallel_map

feat: replace parallel_map with concurrent.futures.ProcessPoolExecuto…

70984bd

…r for improved parallel processing

feat: switch from ProcessPoolExecutor to ThreadPoolExecutor for impro…

7fe6bf4

…ved concurrency in batch inference

feat: enhance batch inference and S3 integration with schema inferenc…

4bab774

…e and temporary file handling

feat: remove redundant import of ThreadPoolExecutor in batch_inferenc…

139fc89

…e function

feat: streamline S3 file handling with temporary file management and …

cfa7831

…remove unused functions

feat: add logging for data export process in batch inference

7740a61

feat: add timing logs for data export and batch inference processes

bfd8dcb

feat: update batch file processing to handle single S3 file input and…

e66fcda

… improve logging

abhishek-pattern added 4 commits February 18, 2026 12:39

fix: update S3 path validation in copy_s3_to_snowflake function

da74b40

fix: update S3 path replacement logic in copy_s3_to_snowflake function

db9d610

fix: simplify S3 bucket replacement logic in copy_s3_to_snowflake fun…

99ae8ec

…ction

fix: refactor S3 configuration and update copy functions in pandas mo…

af71a43

…dule

Copilot AI review requested due to automatic review settings February 18, 2026 09:27

Copilot started reviewing on behalf of abhishek-pattern February 18, 2026 09:27 View session

Copilot AI reviewed Feb 18, 2026

View reviewed changes

abhishek-pattern added 22 commits February 18, 2026 15:11

fix: remove unused copy functions and update table schema definition …

e66664b

…in test cases

fix: update import statement to use s3_stage module for copy query fu…

2fb612e

…nctions

feat: implement BatchInferencePipeline class for orchestrating batch …

c90b043

…inference with S3 and Snowflake

feat: add BatchInferencePipeline class for orchestrating batch infere…

2660c54

…nce with S3 and Snowflake

fix: update BatchInferencePipeline to use worker_ids instead of batch…

3496dfa

…_ids for foreach processing

feat: enhance BatchInferencePipeline with file splitting for worker p…

f9c5622

…rocessing and improve batch handling

fix: correct spelling of 'table_definition' in multiple files

c2e2710

feat: add run method to BatchInferencePipeline for streamlined proces…

b10f061

…sing and publishing

fix: adjust worker file assignment to start from 1 for correct indexing

2208b0a

fix: adjust worker file assignment to start from 1 for correct indexing

e171fc3

fix: remove debug print statements and add error handling for table s…

fb087e7

…chema inference

feat: update BatchInferencePipeline and S3 stage for enhanced path ha…

7ad286c

…ndling and initialization

fix: update import from BatchInferenceManager to BatchInferencePipeli…

14975a6

…ne for consistency

feat: enhance logging for Snowflake to S3 export process and ensure c…

6824499

…onsistent path formatting

fix: adjust file indexing in download worker to start from 1 for accu…

1cab798

…rate tracking

fix: rename timeout parameter for clarity and update usage in batch p…

3f96678

…rocessing

fix: ensure all worker futures are awaited and exceptions are propaga…

82cc466

…ted in BatchInferencePipeline

feat: implement ShutdownableQueue for graceful shutdown in batch proc…

224d001

…essing

feat: implement ShutdownableQueue for graceful shutdown in batch proc…

1263f8f

…essing

feat: add Metaflow flow for batch inference testing with Snowflake in…

33b5116

…tegration

refactor: replace debug logging with print statements for better visi…

8687ae9

…bility in batch inference and S3 operations

refactor: remove unused _debug import from s3_stage.py

3b4736f

abhishek-pattern closed this Feb 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/snowflake s3 stage operations#18

Feature/snowflake s3 stage operations#18
abhishek-pattern wants to merge 100 commits intomainfrom
feature/snowflake-s3-stage-operations

abhishek-pattern commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Copilot AI Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

-import pandas as pd
-from metaflow import S3, current
-def _get_metaflow_s3_client():
-    return S3(role="arn:aws:iam::209479263910:role/outerbounds_iam_role")
+import os
+import pandas as pd
+from metaflow import S3, current
+def _get_metaflow_s3_client():
+    """
+    Return a Metaflow S3 client.
+    The role ARN can be configured via the METAFLOW_S3_ROLE_ARN environment variable.
+    If not set, the default Metaflow credential resolution is used.
+    """
+    role_arn = os.getenv("METAFLOW_S3_ROLE_ARN")
+    if role_arn:
+        return S3(role=role_arn)
+    return S3()

	table_defination: Optional[List[Tuple[str, str]]] = None,
	table_definition: Optional[List[Tuple[str, str]]] = None,

	table_defination: Optional[List[Tuple[str, str]]] = None,
	table_schema: Optional[List[Tuple[str, str]]] = None,

-        print(f"Generated TRUNCATE TABLE query:\nTRUNCATE TABLE IF EXISTS {table_name};")
-        sql_statements.append(f"TRUNCATE TABLE IF EXISTS {table_name};")
+        print(f"Generated TRUNCATE TABLE query:\nTRUNCATE TABLE {table_name};")
+        sql_statements.append(f"TRUNCATE TABLE {table_name};")

Conversation

abhishek-pattern commented Feb 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments