Event-Driven Data Pipeline
- Overview
- Architecture
- Services Used
- Project Folder Structure
- Setup Instructions
- Execution Flow
- Monitoring & Alerts
- Data Quality Checks
- Unit Testing
- Future Improvements
- Resources & References
This project is an event-driven data pipeline built with AWS. It automatically collects, processes, and stores user activity data using S3, Lambda, Step Functions, and DynamoDB. The pipeline is serverless, scalable, and monitored with CloudWatch—making it easy to handle data in real time with minimal manual effort.
- S3: Stores raw user activity events.
- Lambda: Serverless compute for collection & processing.
- Step Functions: Orchestrates workflow with error handling.
- DynamoDB: Stores processed user activity records.
- CloudWatch: Monitors logs, metrics, and alerts.
- IAM: Secure access control for all services.
- Terraform: Infrastructure as Code for full automation.
- Python: Transaction generator & Lambda logic.
event-driven-pipeline/
│── lambdas/
│ ├── data_processor.py
│ └── orchestrator.py
│
│── terraform/
│ ├── main.tf
│ ├── lambda.tf
│ ├── s3.tf
│ ├── dynamodb.tf
│ ├── cloudwatch.tf
│ ├── iam.tf
│ ├── ssm.tf
│ ├── variables.tf
│ ├── sfn.tf
│ └── outputs.tf
│
│── sdk-scripts/
│ ├── upload_test_file.py
│ └── upload_bulk_data.py
│
└── README.md
- AWS account with permissions for S3, Lambda, Step Functions, DynamoDB, IAM, and CloudWatch.
- Terraform installed.
- Python 3.9+ installed (for simulator)
git clone <repo-url>
cd event-driven-pipelineaws configureterraform init
terraform plan
terraform applycd ./sdk-scripts/
REGION="us-east-2"
python upload_test_file.pyexport BUCKET="user-activity-bucket-demo-1234"
export S3_BUCKET=$(terraform output -raw s3_bucket_name)
export ROWS_PER_FILE="10000"
export TOTAL_FILES="20"
export WORKERS="8"
cd ./sdk-scripts/
python upload_bulk_data.py-
Data Ingestion
- Applications export user-activity events as JSON/JSONL (optionally gzipped).
- Files are uploaded to Amazon S3 under the
ingest/prefix (e.g.,ingest/YYYY/MM/dd/file.jsonl[.gz]).
-
Orchestration
- An S3 ObjectCreated event triggers the Orchestrator Lambda.
- Orchestrator reads the State Machine ARN from SSM Parameter Store and starts an AWS Step Functions execution with
{bucket, key}.
-
Processing
- Step Functions invokes the Processor Lambda with the S3 object details.
- Processor Lambda downloads the file from S3, decompresses if
.gz, and parses JSON Lines. - Validates required fields (e.g.,
user_id,timestamp/id), normalizes types (timestamps, decimals). - Writes valid records to Amazon DynamoDB using batch writes for throughput.
-
Monitoring
- Amazon CloudWatch captures logs from both Lambdas and the Step Functions execution history.
- CloudWatch Metrics & Alarms track Step Functions failures, Lambda errors/duration, and DynamoDB throttling—alerting the team when thresholds are exceeded.
-
Error Handling
- Malformed rows are counted and logged; the run can still succeed if non-fatal (summary includes lines, written, errors).
- Unhandled exceptions cause the state machine to transition to Fail, which is surfaced via CloudWatch alarms.
Monitoring ensures the pipeline runs smoothly and alerts stakeholders if something goes wrong.
- CloudWatch Logs
- Each Lambda function writes detailed logs (inputs, errors, processing statistics).
- Step Functions logs execution history and state transitions.
- CloudWatch Alarms
- Alerts are configured to notify developers if thresholds are crossed:
- High Lambda error rates.
- DynamoDB throttling or latency spikes.
- Step Function executions ending in failure.
- Alerts are configured to notify developers if thresholds are crossed:
Data Quality Checks (DQCs) are automated rules or validations you apply to data to make sure it’s clean, complete, and trustworthy before using or storing it.
For example: It’s like checking groceries before putting them in the fridge.
You want to:
- Remove spoiled items
- Make sure you didn’t miss anything important
- Ensure everything is labeled and stored correctly
Unit testing means checking that small pieces of our code (called “units”) work as expected—in isolation, without depending on the whole system.
Think of unit testing like checking each Lambda function or helper separately: for example, you test the S3 ingestion logic, the data processor, and the orchestrator independently before running the full pipeline. In this project, a "unit" is typically a function in your Lambda code or a utility in lambdas/data_quality.py.
How to Run Unit Tests & Data Quality Checks:
# Run all unit tests in the tests directory
python -m unittest discover tests/
# Run data quality tests specifically (if using pytest)
pytest tests/test_data_quality.pyBefore running tests, install dependencies:
pip install -r requirements.txtExamples of Unit Tests in This Project:
validate_item()– Checks that a data record from S3 contains all required fields and correct types (seelambdas/data_quality.py).validate_json_line()– Ensures each line in a gzipped JSONL file is valid and meets schema requirements.lambda_handler()indata_processor.py– Can be tested with mock S3 events to verify correct DynamoDB writes and error handling.
These tests help ensure your pipeline only processes valid, well-formed data and that each Lambda function works as expected in isolation.
While the pipeline works as designed, future enhancements can improve scalability, functionality, and usability.
- Real-Time Processing
- Replace file-based ingestion with Kinesis Data Streams for true real-time analytics.
- Data Lake Integration
- Store all events in S3 Data Lake with partitioning for analytics using Athena or Glue.
- Error Management
- Implement a dedicated DLQ (Dead-Letter Queue) to reprocess failed events automatically.
- Security Enhancements
- Encrypt data at rest in S3 and DynamoDB with KMS.
- Enforce fine-grained IAM roles for least privilege access.
- Visualization
- Build a dashboard (QuickSight or custom web app) to visualize user activity trends in near real-time.
- CI/CD Integration
- Automate deployment of Lambda functions and Terraform changes using AWS CodePipeline or GitHub Actions.
