Post-ETL Sync Processing

The Post-ETL Sync Processing feature enables custom data transformations after the standard ETL process is completed. This functionality allows organizations to create derived tables, perform complex data aggregations, and implement custom business logic on the ETL data.

Key Features

Configurable Transformations: Define custom SQL transformations through JSON configuration files
Incremental Processing: Only process data that has changed since the last sync
Ordered Execution: Execute DDL and DML operations in a specified order
Schema-specific Configurations: Support for organization-specific transformations
Transaction Management: Ensures data integrity during the transformation process

Repository and Setup

This feature is part of the avni-etl repository. For complete setup instructions and the main ETL service documentation, refer to:

GitHub Repository: https://github.com/avniproject/avni-etl
ETL Service Documentation: See the README.md in the avni-etl repository for basic ETL setup and configuration
Refer Developer Avni Setup documentation for more information

Prerequisites

Before setting up Post-ETL Sync Processing, ensure you have:

A working ETL service setup (as documented in the avni-etl repository)
Organization-specific ETL schema configured
Appropriate database permissions for the organization user

Quick Setup Reference

For detailed setup instructions, refer to the avni-etl repository README.md, but the key steps include:

Clone the avni-etl repository
Set up the database and ETL schema for your organization
Configure the Post-ETL Sync processing as described in this document

Implementation Details

Key Components

PostETLConfig: Domain model for the configuration structure
PostETLSyncService: Service that orchestrates the execution of post-ETL sync scripts
PostETLSyncStatusRepository: Manages the sync state and cutoff timestamps
EtlService: Integration point to trigger post-ETL sync after the main ETL process

Execution Flow

The ETL process completes
PostETLSyncService loads the organization-specific configuration
DDL scripts are executed in order (if tables don't exist)
DML scripts are executed in order, with inserts followed by updates
The cutoff timestamp is updated for the next run

SQL Scripts

SQL scripts are stored in the organization-specific directory and referenced in the configuration file:

DDL Scripts: Create tables, indexes, and other database objects. These run first and only if the table doesn't already exist.
DML Scripts: Insert and update data in the tables:
- Insert SQL: Adds new records to the target table
- Update SQLs: Updates existing records in the target table

Parameter Substitution

The system automatically replaces the following parameters in SQL scripts:

:previousCutoffDateTime: Timestamp of the last successful sync
:newCutoffDateTime: Current timestamp for this sync
:param1, :param2, etc.: Custom parameters specified in the sql-params array

Configuration Structure

The Post-ETL Sync process is driven by a JSON configuration file named post-etl-sync-processing-config.json located in the organization-specific directory under src/main/resources/post-etl/{organization}/.

{
  "ddl": [
    {
      "order": 1,
      "table": "table_name",
      "sql": "create-table-script.sql",
      "exists_check": "optional_custom_check_query"
    }
  ],
  "dml": [
    {
      "order": 1,
      "table": "target_table_name",
      "sqls": [
        {
          "order": 1,
          "sourceTableName": "source_table",
          "insert-sql": "insert-script.sql",
          "update-sqls": [
            "update-script-1.sql",
            "update-script-2.sql"
          ],
          "sql-params": [
            "optional_param1",
            "optional_param2"
          ]
        }
      ]
    }
  ]
}

Configuration Requirements

DDL Configuration

order: Numeric execution order (starts with 1)
table: Target table name
sql: SQL script filename for table creation
Optional: exists_check for custom table existence verification

DML Configuration

order: Numeric execution order
table: Target table name
sqls: Array of source table operations with:
- order: Execution order within the DML operation
- sourceTableName: Source table for data
- insert-sql: Script filename for insert operations (can be empty if only updates)
- update-sqls: Array of update script filenames (can be empty if only inserts)
- Optional: sql-params for additional parameters

Best Practices

Schema Name Usage

Always qualify table names with the schema name in all SQL scripts:
```
apfodisha.individual_child_growth_monitoring_report
```
Begin DDL scripts with role setting to ensure proper permissions:
```
set role apfodisha;
```
Use consistent schema names throughout all related SQL scripts, matching the directory name under post-etl/

Timestamp Filtering

Always use both timestamp parameters in SQL scripts that modify data:

WHERE (column_datetime > :previousCutoffDateTime AND column_datetime <= :newCutoffDateTime)

Apply filters to multiple timestamp columns when applicable:

WHERE (created_date_time > :previousCutoffDateTime AND created_date_time <= :newCutoffDateTime)
   OR (last_modified_date_time > :previousCutoffDateTime AND last_modified_date_time <= :newCutoffDateTime)

Include timestamp filters in all subqueries and CTEs:

AND follow_up.last_modified_date_time > :previousCutoffDateTime 
AND follow_up.last_modified_date_time <= :newCutoffDateTime

SQL Script Practices

Use descriptive prefixes related to the target table (e.g., icgmr- for individual_child_growth_monitoring_report)
Use CTEs for complex updates and include explicit JOINs with proper conditions
Always include is_voided = false checks when applicable
Specify data types and nullability explicitly in CREATE TABLE statements

Example SQL Script

-- Insert script
INSERT INTO schemaname.custom_report_table (field1, field2, field3)
SELECT 
    s.field1,
    s.field2,
    s.field3
FROM schemaname.source_table s
WHERE (s.created_date_time > :previousCutoffDateTime AND s.created_date_time <= :newCutoffDateTime)
   OR (s.last_modified_date_time > :previousCutoffDateTime AND s.last_modified_date_time <= :newCutoffDateTime);

Adding New Transformations

To add new post-ETL transformations:

Create SQL scripts for table creation and data manipulation
Update the organization's post-etl-sync-processing-config.json file
Place all files in the organization-specific directory under src/main/resources/post-etl/{organization}/

Current Implementations

The system currently implements transformations for:

APF Odisha: Child growth monitoring reports with derived fields for nutrition status
RWB: Custom reporting tables

Troubleshooting

Delete previous version of specific post_etl_sync_processing_config database table

DELETE FROM public.post_etl_sync_status WHERE db_user = 'apfodisha';
DROP TABLE apfodisha.individual_child_growth_monitoring_report;

Common Issues

Check application logs for detailed execution information
Verify SQL scripts use the correct schema and table names
Ensure parameter placeholders match the expected format
Confirm the configuration file follows the correct JSON structure

Security Considerations

SQL scripts run with the organization's database role
Each organization's transformations are isolated to their own schema
The search path is reset after execution to prevent cross-schema access