Post-ETL Sync Processing
Custom data transformations after the standard ETL process is completed'
Post-ETL Sync Processing
The Post-ETL Sync Processing feature enables custom data transformations after the standard ETL process is completed. This functionality allows organizations to create derived tables, perform complex data aggregations, and implement custom business logic on the ETL data.
Key Features
- Configurable Transformations: Define custom SQL transformations through JSON configuration files
- Incremental Processing: Only process data that has changed since the last sync
- Ordered Execution: Execute DDL and DML operations in a specified order
- Schema-specific Configurations: Support for organization-specific transformations
- Transaction Management: Ensures data integrity during the transformation process
Repository and Setup
This feature is part of the avni-etl repository. For complete setup instructions and the main ETL service documentation, refer to:
- GitHub Repository: https://github.com/avniproject/avni-etl
- ETL Service Documentation: See the README.md in the avni-etl repository for basic ETL setup and configuration
- Refer Developer Avni Setup documentation for more information
Prerequisites
Before setting up Post-ETL Sync Processing, ensure you have:
- A working ETL service setup (as documented in the avni-etl repository)
- Organization-specific ETL schema configured
- Appropriate database permissions for the organization user
Quick Setup Reference
For detailed setup instructions, refer to the avni-etl repository README.md, but the key steps include:
- Clone the avni-etl repository
- Set up the database and ETL schema for your organization
- Configure the Post-ETL Sync processing as described in this document
Implementation Details
Key Components
- PostETLConfig: Domain model for the configuration structure
- PostETLSyncService: Service that orchestrates the execution of post-ETL sync scripts
- PostETLSyncStatusRepository: Manages the sync state and cutoff timestamps
- EtlService: Integration point to trigger post-ETL sync after the main ETL process
Execution Flow
- The ETL process completes
- PostETLSyncService loads the organization-specific configuration
- DDL scripts are executed in order (if tables don't exist)
- DML scripts are executed in order, with inserts followed by updates
- The cutoff timestamp is updated for the next run
SQL Scripts
SQL scripts are stored in the organization-specific directory and referenced in the configuration file:
-
DDL Scripts: Create tables, indexes, and other database objects. These run first and only if the table doesn't already exist.
-
DML Scripts: Insert and update data in the tables:
- Insert SQL: Adds new records to the target table
- Update SQLs: Updates existing records in the target table
Parameter Substitution
The system automatically replaces the following parameters in SQL scripts:
:previousCutoffDateTime: Timestamp of the last successful sync:newCutoffDateTime: Current timestamp for this sync:param1,:param2, etc.: Custom parameters specified in thesql-paramsarray
Configuration Structure
The Post-ETL Sync process is driven by a JSON configuration file named post-etl-sync-processing-config.json located in the organization-specific directory under src/main/resources/post-etl/{organization}/.
{
"ddl": [
{
"order": 1,
"table": "table_name",
"sql": "create-table-script.sql",
"exists_check": "optional_custom_check_query"
}
],
"dml": [
{
"order": 1,
"table": "target_table_name",
"sqls": [
{
"order": 1,
"sourceTableName": "source_table",
"insert-sql": "insert-script.sql",
"update-sqls": [
"update-script-1.sql",
"update-script-2.sql"
],
"sql-params": [
"optional_param1",
"optional_param2"
]
}
]
}
]
}Configuration Requirements
DDL Configuration
order: Numeric execution order (starts with 1)table: Target table namesql: SQL script filename for table creation- Optional:
exists_checkfor custom table existence verification
DML Configuration
order: Numeric execution ordertable: Target table namesqls: Array of source table operations with:order: Execution order within the DML operationsourceTableName: Source table for datainsert-sql: Script filename for insert operations (can be empty if only updates)update-sqls: Array of update script filenames (can be empty if only inserts)- Optional:
sql-paramsfor additional parameters
Best Practices
Schema Name Usage
-
Always qualify table names with the schema name in all SQL scripts:
apfodisha.individual_child_growth_monitoring_report -
Begin DDL scripts with role setting to ensure proper permissions:
set role apfodisha; -
Use consistent schema names throughout all related SQL scripts, matching the directory name under
post-etl/
Timestamp Filtering
-
Always use both timestamp parameters in SQL scripts that modify data:
WHERE (column_datetime > :previousCutoffDateTime AND column_datetime <= :newCutoffDateTime) -
Apply filters to multiple timestamp columns when applicable:
WHERE (created_date_time > :previousCutoffDateTime AND created_date_time <= :newCutoffDateTime) OR (last_modified_date_time > :previousCutoffDateTime AND last_modified_date_time <= :newCutoffDateTime) -
Include timestamp filters in all subqueries and CTEs:
AND follow_up.last_modified_date_time > :previousCutoffDateTime AND follow_up.last_modified_date_time <= :newCutoffDateTime
SQL Script Practices
-
Use descriptive prefixes related to the target table (e.g.,
icgmr-for individual_child_growth_monitoring_report) -
Use CTEs for complex updates and include explicit JOINs with proper conditions
-
Always include
is_voided = falsechecks when applicable -
Specify data types and nullability explicitly in CREATE TABLE statements
Example SQL Script
-- Insert script
INSERT INTO schemaname.custom_report_table (field1, field2, field3)
SELECT
s.field1,
s.field2,
s.field3
FROM schemaname.source_table s
WHERE (s.created_date_time > :previousCutoffDateTime AND s.created_date_time <= :newCutoffDateTime)
OR (s.last_modified_date_time > :previousCutoffDateTime AND s.last_modified_date_time <= :newCutoffDateTime);Adding New Transformations
To add new post-ETL transformations:
- Create SQL scripts for table creation and data manipulation
- Update the organization's
post-etl-sync-processing-config.jsonfile - Place all files in the organization-specific directory under
src/main/resources/post-etl/{organization}/
Current Implementations
The system currently implements transformations for:
- APF Odisha: Child growth monitoring reports with derived fields for nutrition status
- RWB: Custom reporting tables
Troubleshooting
Delete previous version of specific post_etl_sync_processing_config database table
DELETE FROM public.post_etl_sync_status WHERE db_user = 'apfodisha';
DROP TABLE apfodisha.individual_child_growth_monitoring_report;Common Issues
- Check application logs for detailed execution information
- Verify SQL scripts use the correct schema and table names
- Ensure parameter placeholders match the expected format
- Confirm the configuration file follows the correct JSON structure
Security Considerations
- SQL scripts run with the organization's database role
- Each organization's transformations are isolated to their own schema
- The search path is reset after execution to prevent cross-schema access
Updated 24 days ago
