Autoencoders for Anomaly Detection

Monitoring updates to automated updates with tensorflow

In an increasingly SaaS world, datasets are often published through an ETL pipeline, based on some specified cadence, from their system of record to a separate location. Update failures, when they happen, are typically due to schema mismatches – when the update data doesn’t match the existing dataset in schema – or connectivity issues to the source system. These errors are often times obvious, and are typically caught by some established error handling infrastructure.

There are however cases where an update can be faulty in terms of the content being published – fewer rows than expected, erroneous values in a column, etc. – and these types of errors are more difficult to screen for. In cases such as these, an update can run successfully, and the dataset can publish as expected, but the content of the data could be inaccurate. Downstream metrics, vizzes, and reports will be inaccurate as a result. These types of ETL errors are often referred to as “silent failures”.

Can we use deep learning to build a service that addresses this issue by automatically checking a dataset each time an update is published? The service would assess the dataset for completeness, obvious anomalies, or other inconsistencies. If health checks don’t pass, the user can be alerted by the service that their dataset may not have updated properly and take action.