Introduction
Overview
The Automated Ingestion & Schema Mapping cookbook provides a complete solution for automating the data ingestion process when new CSV files arrive in your data lake. This workflow eliminates manual schema mapping and dbt model creation, reducing time-to-insight and improving data engineering productivity.
Architecture
graph TD
A[CSV File Lands in S3] --> B[Airflow S3 Sensor]
B --> C[DAG Extracts Schema]
C --> D[Trigger GitHub Action]
D --> E[Chicory Agent Schema Mapping]
E --> F[Generate mapping.json]
F --> G[Create PR with Mapping]
G --> H[PR Merged]
H --> I[Second GitHub Action]
I --> J[Chicory Agent dbt Generation]
J --> K[Create dbt Model + YAML]
K --> L[Final PR with Artifacts]
L --> M[Ready to Run Pipeline]Key Components
1. S3 Bucket & Monitoring
Configured S3 bucket for CSV file drops
IAM roles and permissions for Airflow access
File organization and naming conventions
2. Airflow Orchestration
S3 sensor for file detection
Schema extraction DAG
GitHub Actions triggering
3. Chicory AI Agents
Schema Mapping Agent: Maps source CSV schema to target data model
dbt Generation Agent: Creates dbt models and YAML documentation
4. GitHub Actions Workflows
Automated PR creation for schema mappings
dbt artifact generation and deployment
Integration with version control
Prerequisites
Before starting this cookbook, ensure you have:
AWS Account with S3 bucket access
Airflow deployment (Cloud Composer, Astronomer, or self-hosted)
GitHub repository for your dbt project
Chicory AI account and API access
dbt project structure in place
Python 3.8+ for local development
Benefits
Zero Manual Intervention: Complete automation from file landing to dbt model
Consistent Schema Mapping: AI-powered mapping reduces errors and improves consistency
Version Control Integration: All changes tracked through GitHub PRs
Scalable Architecture: Handles multiple files and schema variations
Audit Trail: Complete lineage and change history
Fast Time-to-Value: New data sources available within minutes
Use Cases
This cookbook is ideal for:
Data Lakes with frequent CSV ingestion
Multi-source Data Integration projects
Agile Analytics environments requiring fast iteration
Self-service Analytics platforms
Compliance-heavy industries requiring audit trails
Next: S3 Bucket Setup
Last updated