The Anomalies Detection Model is a neural network designed to predict the future Total Payment Volume (TPV) behavior of a group of merchants at two-hour intervals. The script leverages these predictions to estimate a minimum and maximum TPV for each time point. If the actual TPV falls outside this range, a visual alarm is triggered.
Link to the repo.The project is implemented using a variety of technologies, including:
The culmination of this project is a Power BI dashboard that automatically refreshes every two hours. The dashboard provides a comprehensive view of the predicted and actual TPV, highlighting anomalies that require attention.
To ensure everything will work properly, install all dependencies included in the requirements file included in the repo:
!pip install -r requirements.txt
First thing first. We need to get historical data to train the model. In order to achieve this I perform AWS Athena queries iteratively. I could do it using MySQL instead, but due to the large number of rows needed AWS Athena end up being the option with the best performance. I use the data_get.py script to perform the data wrangler.
We'll need pandas and numpy as a framework to manipulate data as well as datetime to generate dates used in the script. Also, pyathena is needed to create the connection with AWS Athena and their respective keys:
import pandas as pd
import numpy as np
from datetime import datetime
from pyathena import connect
# Cargar variables de entorno desde el archivo .env
# Obtener las variables de entorno
aws_access_key_id = "Access-key-de-la-empresa"
aws_secret_access_key = "secret-access-key-de-la-empresa"
aws_session_token = "session-token-de-la-empresa"
s3_staging_dir = "s3-address"
region_name = "region"
I used 3 functions to get the data needed for the project: athena_query, queries_string_generator, and data_get.
start = datetime(2020, 1, 1)
end = datetime(2023, 10, 1)
queries_test = queries_string_generator(start, end)
print(f"{len(queries_test)} queries fueron almacenadas")
data_get(queries_test)
As I mentioned, the script is designed to make two-hour interval predictions, the next step is to transform the raw data into time intervals. Basically, I just made a group_by using an interval and client_name. All the data transformation is in the data_transform.py script.
Then, with the data_consolid.py script, I took all the .CSV files generated for the 2-hourly intervals and concat them in a single file called full_2_hourly.csv
What can I say? This project is simply awesome.
I can't even explain why this is so fucking good.