Introduction

Overview

The Automated Ingestion & Schema Mapping cookbook provides a complete solution for automating the data ingestion process when new CSV files arrive in your data lake. This workflow eliminates manual schema mapping and dbt model creation, reducing time-to-insight and improving data engineering productivity.

Architecture

graph TD
    A[CSV File Lands in S3] --> B[Airflow S3 Sensor]
    B --> C[DAG Extracts Schema]
    C --> D[Trigger GitHub Action]
    D --> E[Chicory Agent Schema Mapping]
    E --> F[Generate mapping.json]
    F --> G[Create PR with Mapping]
    G --> H[PR Merged]
    H --> I[Second GitHub Action]
    I --> J[Chicory Agent dbt Generation]
    J --> K[Create dbt Model + YAML]
    K --> L[Final PR with Artifacts]
    L --> M[Ready to Run Pipeline]

Key Components

1. S3 Bucket & Monitoring

Configured S3 bucket for CSV file drops
IAM roles and permissions for Airflow access
File organization and naming conventions

2. Airflow Orchestration

S3 sensor for file detection
Schema extraction DAG
GitHub Actions triggering

3. Chicory AI Agents

Schema Mapping Agent: Maps source CSV schema to target data model
dbt Generation Agent: Creates dbt models and YAML documentation

4. GitHub Actions Workflows

Automated PR creation for schema mappings
dbt artifact generation and deployment
Integration with version control

Prerequisites

Before starting this cookbook, ensure you have:

AWS Account with S3 bucket access
Airflow deployment (Cloud Composer, Astronomer, or self-hosted)
GitHub repository for your dbt project
Chicory AI account and API access
dbt project structure in place
Python 3.8+ for local development

Benefits

Zero Manual Intervention: Complete automation from file landing to dbt model
Consistent Schema Mapping: AI-powered mapping reduces errors and improves consistency
Version Control Integration: All changes tracked through GitHub PRs
Scalable Architecture: Handles multiple files and schema variations
Audit Trail: Complete lineage and change history
Fast Time-to-Value: New data sources available within minutes

Use Cases

This cookbook is ideal for:

Data Lakes with frequent CSV ingestion
Multi-source Data Integration projects
Agile Analytics environments requiring fast iteration
Self-service Analytics platforms
Compliance-heavy industries requiring audit trails

Next: S3 Bucket Setup

PreviousAutomated Ingestion & Schema Mapping NextAirflow DAG Configuration

Last updated 20 days ago