Business Need
A large NBFC wanted to modernize its data platform to support rapid growth, stronger governance, and reliable reporting. They needed a unified system that could securely ingest data from multiple sources, automate daily processing, and enable faster analytics for business and compliance teams.
Business Challenge
The client’s existing data processes relied heavily on manual scripts, inconsistent workflows, and limited visibility into pipeline health. As the number of data sources grew (PostgreSQL, Salesforce, APIs), so did the complexity, resulting in delayed reporting, data quality issues, and operational inefficiencies.
The absence of centralized audit tracking and monitoring further increased risk and made compliance difficult.
Business Solution
Our team implemented a fully automated, cloud-native data platform using AWS Glue, Amazon S3, Amazon RDS, and Snowflake, tailored to the client’s operational and compliance needs.
Key elements of the solution included:
- Highly Available Data Architecture:
Raw and clean data layers in S3, multi-AZ RDS for config/audit control, and Snowflake for publish/serve layers. - Automated ETL Pipelines:
PySpark-based Glue jobs extract data from PostgreSQL, Salesforce, and APIs, transform it, and publish to Snowflake. All flows are fully automated and scheduled daily. - Centralized Audit and Governance:
Metadata tables in RDS track run status, record counts, timestamps, and SLA adherence.
End-to-end encryption is enforced using AWS KMS. - Advanced Monitoring and Alerting:
CloudWatch dashboards track job duration, failures, and DPU usage.
SNS alerts notify teams proactively about pipeline issues. - Cost-Optimized Processing:
Glue G1.X workers, Parquet + Snappy compression, and S3 partitioning ensure high performance at low cost.
Tech Stack
- AWS Services: Glue, S3, RDS (Multi-AZ), IAM, KMS, CloudWatch, CloudTrail, SNS
- Data Warehouse: Snowflake
- Orchestration: Glue Triggers and Python Shell jobs
- Security: KMS encryption, private VPC, IAM least-privilege
- Formats / Tools: Parquet, PySpark, Python, REST APIs, JDBC
Business Impact
- Faster Data Availability:
ETL time reduced from 4 hours to under 1 hour, improving reporting timeliness. - Reduced Manual Effort:
40–50 hours of manual processing are saved every month through full automation. - Improved Governance and Compliance:
End-to-end audit logs, encryption, and monitoring strengthened the compliance posture. - Reliable and Scalable Foundation:
The platform handles 1+ GB of daily load consistently and is ready for future data expansion. - Cost Efficiency:
Total TCO optimized to ~$675/month using serverless components and compression.
