Carlos Sánchez

Projects Menu

Anomaly Detection Machine Learning Model

Overview

The Anomalies Detection Model is a neural network designed to predict the future Total Payment Volume (TPV) behavior of a group of merchants at two-hour intervals. The script leverages these predictions to estimate a minimum and maximum TPV for each time point. If the actual TPV falls outside this range, a visual alarm is triggered.

Link to the repo.

Technologies Used

The project is implemented using a variety of technologies, including:

Final Product

The culmination of this project is a Power BI dashboard that automatically refreshes every two hours. The dashboard provides a comprehensive view of the predicted and actual TPV, highlighting anomalies that require attention.

Usage

  1. Data Preparation:
    • Ensure your dataset is formatted correctly.
    • Preprocess the data using the provided preprocessing scripts.
  2. Model Training:
    • Execute the training script to train the LSTM TensorFlow Neural Network.
  3. Prediction and Alarm:
    • Run the prediction script to obtain future TPV predictions.
    • The script will raise a visual alarm if the actual TPV is outside the predicted range.
  4. Power BI Dashboard:
    • Access the Power BI dashboard for a real-time visualization of TPV anomalies.

Dependencies

To ensure everything will work properly, install all dependencies included in the requirements file included in the repo:

!pip install -r requirements.txt

Step-by-step technical explanation.

Data wrangler

First thing first. We need to get historical data to train the model. In order to achieve this I perform AWS Athena queries iteratively. I could do it using MySQL instead, but due to the large number of rows needed AWS Athena end up being the option with the best performance. I use the data_get.py script to perform the data wrangler.

Libraries and environment variables.

We'll need pandas and numpy as a framework to manipulate data as well as datetime to generate dates used in the script. Also, pyathena is needed to create the connection with AWS Athena and their respective keys:

import pandas as pd
import numpy as np
from datetime import datetime
from pyathena import connect

# Cargar variables de entorno desde el archivo .env
# Obtener las variables de entorno
aws_access_key_id = "Access-key-de-la-empresa"
aws_secret_access_key = "secret-access-key-de-la-empresa"
aws_session_token = "session-token-de-la-empresa"
s3_staging_dir = "s3-address"
region_name = "region"

Data generation.

I used 3 functions to get the data needed for the project: athena_query, queries_string_generator, and data_get.

Finally the implementation of the three functions.

start = datetime(2020, 1, 1)
end = datetime(2023, 10, 1)
queries_test = queries_string_generator(start, end)
print(f"{len(queries_test)} queries fueron almacenadas")

data_get(queries_test)

Data transform

As I mentioned, the script is designed to make two-hour interval predictions, the next step is to transform the raw data into time intervals. Basically, I just made a group_by using an interval and client_name. All the data transformation is in the data_transform.py script.

Data consolidation.

Then, with the data_consolid.py script, I took all the .CSV files generated for the 2-hourly intervals and concat them in a single file called full_2_hourly.csv

Project 2 template

What can I say? This project is simply awesome.

I can't even explain why this is so fucking good.