Deep Anomaly Detection for large scale enterprise data

Deep Anomaly Detection for large scale enterprise data

7 Feb 2020 | Team Wavelabs

In generic terms, anomaly detection intends to help distinguish events that are pretty rare and/or are deviating from the norm. This is of high importance to the finance industry like in consumer banking, anomalies might be critical things — like credit card fraud. In other cases, an anomaly might be something that companies look for to leverage from it. Some of the other applications include Intrusions in communication networks, Fake news, and misinformation, Healthcare analysis, Industry damage detection, Manufacturing, Security and surveillance, etc.

The use-case shown in this article is from the SAP domain particularly, Finance. The business goal is to find anomalous behavior in financial transactions.

A typical financial transaction in an Accounting Information System would look like this.

Most such entries fall into being regular transactions, but quite a few show malicious behavior which turns out to be anomalies. The most widely used use-case in every financial domain is detecting fraud and anomaly detection methods can aid substantially in detecting fraud in cases where it takes so much manual effort to do so.

In this article, I will talk about a cutting-edge anomaly detection method using Autoencoder Neural Network (AENN). This is a deep learning-based anomaly detection method.

Well, about the dataset

The dataset used for this use case can be found in the GitHub link provided. This is a synthetic dataset of financial data modified to appear more similar to a real-world dataset that one usually observes in SAP-ERP systems especially the Finance and Cost controlling module.

The dataset contains 7 categorical and 2 numerical attributes available in the FICO BKPF table (containing the posted journal entry headers) and BSEG table (containing the posted journal entry segments) tables.

Another attribute “label” can also be found in the data that explains the true nature of the transaction is a regular or an anomaly (local or global). This is provided to validate the model and won’t be used in the training part.

Classification of anomalies:

Usually, in the industry anomalies are classified in many ways depending on the use-case. When conducting a detailed examination of real-world journal entries, usually recorded in large-scaled AIS or ERP systems, two prevalent characteristics can be observed:specific transactions attribute exhibit a high variety of distinct attribute values e.g. customer information, posted sub-ledgers, amount information, andthe transactions exhibit strong dependencies between specific attribute values e.g. between customer information and type of payment, posting type and general ledgers.

Derived from this observation, two classes of anomalous journal entries can be distinguished, namely “global” and “local” anomalies.

ed to capture this type of anomaly. However, such tests often result in a high volume of false-positive alerts due to events such as reverse postings, provisions and year-end adjustments usually associated with a low fraud risk. Furthermore, when consulting with auditors and forensic accountants, “global” anomalies often refer to “error” rather than “fraud”.

Local accounting anomalies are journaled entries that exhibit an unusual or rare combination of attribute values while their attribute values occur quite frequently e.g. unusual accounting records, irregular combinations of general ledger accounts, user accounts used by several accounting departments. This type of anomaly is significantly more difficult to detect since perpetrators intend to disguise their activities by imitating a regular activity pattern. As a result, such anomalies usually pose a high fraud risk since they correspond to processes and activities that might not be conducted in compliance with organizational standards.Prerequisites: Audiences are expected to be familiar with the basics of how neurons and neural networks work in Deep learning. Here is an excellent tutorial to give you a precise understanding of Neural networks.

Anomaly Detection using Autoencoder Neural Networks — Theory

Autoencoders have been widely used in computer vision and speech processing. But it is a little known fact that they can also be used for anomaly detection. In this section, we introduce the main elements of autoencoder neural networks.

A typical autoencoder consists of two non-linear mapping functions called as Encoder-f(x) and Decoder-g(x) neural networks. Encoder usually follows a funnel-like paradigm with a decreasing set of neurons and a decoder typically is the symmetric mirror of the encoder. There exists a hidden central layer referred to as a latent layer of lower dimensions which will be a compressed rich representation of the input data enough to reconstruct it will minimal reconstruction error.

The idea behind using this algorithmic paradigm for anomaly detection consists of two main steps: learning the normal behavior of the system (based on past data) and detecting anomalous behavior in real-time (by processing real-time data).

Because of the nature of the anomaly dataset which is highly biased towards being regular, the network learns how to reconstruct a regular transaction and fails to do so for an anomaly. Based on such high reconstruction errors we can identify whether a transaction is a regular one or an anomaly. Here out loss function is the reconstruction error itself.


Loss function(reconstruction error) = arg min || x — g(f(x)) ||

In this use case, we used the binary cross-entropy loss given by.


−(xlog(x’)+(1−x)log(1−x’))

x being the input data, x’ being g(f(x)). This is measuring how similar the given two distributions are. The lower the loss, the similar, input and its reconstruction are.

Implementation

Note: Here is where it gets a bit technical so i advice all the non-tech folks to skip this section. You can go through it but don’t get intimidated by it 🙂

Import the necessary libraries and set some parameters.

# importing utilities
import os
import sys
from datetime import datetime
# importing data science libraries
import pandas as pd
import random as rd
import numpy as np
# importing pytorch libraries
import torch
from torch import nn
from torch import autograd
from torch.utils.data import DataLoader
# import visualization libraries
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
from IPython.display import Image, display
sns.set_style('darkgrid')
# ignore potential warnings
import warnings
warnings.filterwarnings("ignore")

Set random seed and use GPU if available.


rseed = 1234 
rd.seed(rseed)
np.random.seed(rseed)
torch.manual_seed(rseed) 
if (torch.backends.cudnn.version() != None and USE_CUDA == True):
    torch.cuda.manual_seed(rseed)
USE_CUDA = True

Import the data into a pandas data frame.


ad_dataset = pd.read_csv('./data/fraud_dataset_v2.csv')
ad_dataset.head()

Look at shape and label value_counts.


ad_dataset.shape
Out[#]: (533009, 10)
ad_dataset.label.value_counts()
Out[#]: regular    532909
         global         70
         local          30
         Name: label, dtype: int64

As you see, its a highly biased dataset which is true for most real-world data. Anomalies are 0.018% of the total data. Any typical machine learning algorithm would not perform well in such cases. But the approach shown in the article is a clever trick to leverage autoencoders to find anomalies.

Let’s remove the label for further processing as autoencoder is an unsupervised technique.


label = ad_dataset.pop('label')

Now let’s split categorical and numerical attributes. Add one-hot encodings to the categorical attributes to vectorize them. Apply log scaling and min-max scaling to the numerical variables.


categorical_attr = ['KTOSL', 'PRCTR', 'BSCHL', 'HKONT', 'WAERS', 'BUKRS']
ad_dataset_categ_transformed = pd.get_dummies(ad_dataset[categorical_attr])
numeric_attr_names = ['DMBTR', 'WRBTR']
# add a small epsilon to eliminate zero values from data for log scaling
numeric_attr = ad_dataset[numeric_attr] + 1e-7
numeric_attr = numeric_attr.apply(np.log)
ad_dataset_numeric_attr = (numeric_attr - numeric_attr.min()) / (numeric_attr.max() - numeric_attr.min())

Concatenate both numerical and catogorical attributes.


ad_subset_transformed = pd.concat([ad_dataset_categ_transformed, ad_dataset_numeric_attr], axis = 1)
ad_subset_transformed.shape
Out[#]: (533009, 618)

Now let’s implement the encoder network(618–512–256–128–64–32–16–8–4–3).


# implementation of the encoder network
class encoder(nn.Module):
def __init__(self):
super(encoder, self).__init__()
# specify layer 1 - in 618, out 512
        self.encoder_L1 = nn.Linear(in_features=ori_subset_transformed.shape[1], out_features=512, bias=True) # add linearity 
        nn.init.xavier_uniform_(self.encoder_L1.weight) # init weights according to [9]
        self.encoder_R1 = nn.LeakyReLU(negative_slope=0.4, inplace=True) # add non-linearity according to [10]
# specify layer 2 - in 512, out 256
        self.encoder_L2 = nn.Linear(512, 256, bias=True)
        nn.init.xavier_uniform_(self.encoder_L2.weight)
        self.encoder_R2 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 3 - in 256, out 128
        self.encoder_L3 = nn.Linear(256, 128, bias=True)
        nn.init.xavier_uniform_(self.encoder_L3.weight)
        self.encoder_R3 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 4 - in 128, out 64
        self.encoder_L4 = nn.Linear(128, 64, bias=True)
        nn.init.xavier_uniform_(self.encoder_L4.weight)
        self.encoder_R4 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 5 - in 64, out 32
        self.encoder_L5 = nn.Linear(64, 32, bias=True)
        nn.init.xavier_uniform_(self.encoder_L5.weight)
        self.encoder_R5 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 6 - in 32, out 16
        self.encoder_L6 = nn.Linear(32, 16, bias=True)
        nn.init.xavier_uniform_(self.encoder_L6.weight)
        self.encoder_R6 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 7 - in 16, out 8
        self.encoder_L7 = nn.Linear(16, 8, bias=True)
        nn.init.xavier_uniform_(self.encoder_L7.weight)
        self.encoder_R7 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 8 - in 8, out 4
        self.encoder_L8 = nn.Linear(8, 4, bias=True)
        nn.init.xavier_uniform_(self.encoder_L8.weight)
        self.encoder_R8 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 9 - in 4, out 3
        self.encoder_L9 = nn.Linear(4, 3, bias=True)
        nn.init.xavier_uniform_(self.encoder_L9.weight)
        self.encoder_R9 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# init dropout layer with probability p
        self.dropout = nn.Dropout(p=0.0, inplace=True)
        
    def forward(self, x):
# define forward pass through the network
        x = self.encoder_R1(self.dropout(self.encoder_L1(x)))
        x = self.encoder_R2(self.dropout(self.encoder_L2(x)))
        x = self.encoder_R3(self.dropout(self.encoder_L3(x)))
        x = self.encoder_R4(self.dropout(self.encoder_L4(x)))
        x = self.encoder_R5(self.dropout(self.encoder_L5(x)))
        x = self.encoder_R6(self.dropout(self.encoder_L6(x)))
        x = self.encoder_R7(self.dropout(self.encoder_L7(x)))
        x = self.encoder_R8(self.dropout(self.encoder_L8(x)))
        x = self.encoder_R9(self.encoder_L9(x))
return x

Instantiate the encoder and put in on


# init training network classes / architectures
encoder_train = encoder()
# push to cuda if cudnn is available
if (torch.backends.cudnn.version() != None and USE_CUDA == True):
    encoder_train = encoder().cuda()

Now, the decoder network implementation which is the symmetric mirror of the encoder. (3–4–8–16–32–64–128–256–512–618)


# implementation of the decoder network
class decoder(nn.Module):
def __init__(self):
super(decoder, self).__init__()
# specify layer 1 - in 3, out 4
        self.decoder_L1 = nn.Linear(in_features=3, out_features=4, bias=True) # add linearity 
        nn.init.xavier_uniform_(self.decoder_L1.weight)  # init weights according to [9]
        self.decoder_R1 = nn.LeakyReLU(negative_slope=0.4, inplace=True) # add non-linearity according to [10]
# specify layer 2 - in 4, out 8
        self.decoder_L2 = nn.Linear(4, 8, bias=True)
        nn.init.xavier_uniform_(self.decoder_L2.weight)
        self.decoder_R2 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 3 - in 8, out 16
        self.decoder_L3 = nn.Linear(8, 16, bias=True)
        nn.init.xavier_uniform_(self.decoder_L3.weight)
        self.decoder_R3 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 4 - in 16, out 32
        self.decoder_L4 = nn.Linear(16, 32, bias=True)
        nn.init.xavier_uniform_(self.decoder_L4.weight)
        self.decoder_R4 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 5 - in 32, out 64
        self.decoder_L5 = nn.Linear(32, 64, bias=True)
        nn.init.xavier_uniform_(self.decoder_L5.weight)
        self.decoder_R5 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 6 - in 64, out 128
        self.decoder_L6 = nn.Linear(64, 128, bias=True)
        nn.init.xavier_uniform_(self.decoder_L6.weight)
        self.decoder_R6 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
        
        # specify layer 7 - in 128, out 256
        self.decoder_L7 = nn.Linear(128, 256, bias=True)
        nn.init.xavier_uniform_(self.decoder_L7.weight)
        self.decoder_R7 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 8 - in 256, out 512
        self.decoder_L8 = nn.Linear(256, 512, bias=True)
        nn.init.xavier_uniform_(self.decoder_L8.weight)
        self.decoder_R8 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# specify layer 9 - in 512, out 618
        self.decoder_L9 = nn.Linear(in_features=512, out_features=ori_subset_transformed.shape[1], bias=True)
        nn.init.xavier_uniform_(self.decoder_L9.weight)
        self.decoder_R9 = nn.LeakyReLU(negative_slope=0.4, inplace=True)
# init dropout layer with probability p
        self.dropout = nn.Dropout(p=0.0, inplace=True)
def forward(self, x):
# define forward pass through the network
        x = self.decoder_R1(self.dropout(self.decoder_L1(x)))
        x = self.decoder_R2(self.dropout(self.decoder_L2(x)))
        x = self.decoder_R3(self.dropout(self.decoder_L3(x)))
        x = self.decoder_R4(self.dropout(self.decoder_L4(x)))
        x = self.decoder_R5(self.dropout(self.decoder_L5(x)))
        x = self.decoder_R6(self.dropout(self.decoder_L6(x)))
        x = self.decoder_R7(self.dropout(self.decoder_L7(x)))
        x = self.decoder_R8(self.dropout(self.decoder_L8(x)))
        x = self.decoder_R9(self.decoder_L9(x))
        
        return x

Instantiate the decoder and put it on GPU.


# init training network classes / architectures
decoder_train = decoder()
# push to cuda if cudnn is available
if (torch.backends.cudnn.version() != None) and (USE_CUDA == True):
    decoder_train = decoder().cuda()

Now setting the loss function and some hyperparameters.


# define the optimization criterion / loss function
loss_function = nn.BCEWithLogitsLoss(reduction='mean')
# define learning rate and optimization strategy
learning_rate = 1e-3
encoder_optimizer = torch.optim.Adam(encoder_train.parameters(), lr=learning_rate)
decoder_optimizer = torch.optim.Adam(decoder_train.parameters(), lr=learning_rate)
# specify training parameters
num_epochs = 8
mini_batch_size = 128

Load the data into a tensor and onto GPU.


# convert pre-processed data to pytorch tensor
torch_dataset = torch.from_numpy(ad_subset_transformed.values).float()
# convert to pytorch tensor - none cuda enabled
dataloader = DataLoader(torch_dataset, batch_size=mini_batch_size, shuffle=True, num_workers=0)
# note: we set num_workers to zero to retrieve deterministic results
# determine if CUDA is available at compute node
if (torch.backends.cudnn.version() != None) and (USE_CUDA == True):
    dataloader = DataLoader(torch_dataset.cuda(), batch_size=mini_batch_size, shuffle=True)

Now to our training. (Note: I advise not to copy-paste the below code as the formatting may get wrong. Please get the code from the GitHub link mentioned below.)


# init collection of mini-batch losses
losses = []
# convert encoded transactional data to torch Variable
data = autograd.Variable(torch_dataset)
# train autoencoder model
for epoch in range(num_epochs):
# init mini batch counter
    mini_batch_count = 0
    
    # determine if CUDA is available at compute node
    if(torch.backends.cudnn.version() != None) and (USE_CUDA == True):
        
        # set networks / models in GPU mode
        encoder_train.cuda()
        decoder_train.cuda()
# set networks in training mode (apply dropout when needed)
    encoder_train.train()
    decoder_train.train()
# start timer
    start_time = datetime.now()
        
    # iterate over all mini-batches
    for mini_batch_data in dataloader:
# increase mini batch counter
        mini_batch_count += 1
# convert mini batch to torch variable
        mini_batch_torch = autograd.Variable(mini_batch_data)
# =================== (1) forward pass ============================
# run forward pass
        z_representation = encoder_train(mini_batch_torch) # encode mini-batch data
        mini_batch_reconstruction = decoder_train(z_representation) # decode mini-batch data
        
        # =================== (2) compute reconstruction loss ======
# determine reconstruction loss
        reconstruction_loss = loss_function(mini_batch_reconstruction, mini_batch_torch)
        
        # =================== (3) backward pass ====================
# reset graph gradients
        decoder_optimizer.zero_grad()
        encoder_optimizer.zero_grad()
# run backward pass
        reconstruction_loss.backward()
        
        # =================== (4) update model parameters =========
# update network parameters
        decoder_optimizer.step()
        encoder_optimizer.step()
# =================== monitor training progress ===================
# print training progress each 1'000 mini-batches
        if mini_batch_count % 1000 == 0:
            
            # print the training mode: either on GPU or CPU
            mode = 'GPU' if (torch.backends.cudnn.version() != None) and (USE_CUDA == True) else 'CPU'
            
            # print mini batch reconstuction results
            now = datetime.utcnow().strftime("%Y%m%d-%H:%M:%S")
            end_time = datetime.now() - start_time
            print('[LOG {}] training status, epoch: [{:04}/{:04}], batch: {:04}, loss: {}, mode: {}, time required: {}'.format(now, (epoch+1), num_epochs, mini_batch_count, np.round(reconstruction_loss.item(), 4), mode, end_time))
# reset timer
            start_time = datetime.now()
# =================== evaluate model performance ================
    
    # set networks in evaluation mode (don't apply dropout)
    encoder_train.cpu().eval()
    decoder_train.cpu().eval()
# reconstruct encoded transactional data
    reconstruction = decoder_train(encoder_train(data))
    
    # determine reconstruction loss - all transactions
    reconstruction_loss_all = loss_function(reconstruction, data)
            
    # collect reconstruction loss
    losses.extend([reconstruction_loss_all.item()])
    
    # print reconstuction loss results
    now = datetime.utcnow().strftime("%Y%m%d-%H:%M:%S")
    print('[LOG {}] training status, epoch: [{:04}/{:04}], loss: {:.10f}'.format(now, (epoch+1), num_epochs, reconstruction_loss_all.item()))
# =================== save model snapshot to disk ================
    
    # save trained encoder model file to disk
    encoder_model_name = "ep_{}_encoder_model.pth".format((epoch+1))
    torch.save(encoder_train.state_dict(), os.path.join("./models", encoder_model_name))
# save trained decoder model file to disk
    decoder_model_name = "ep_{}_decoder_model.pth".format((epoch+1))
    torch.save(decoder_train.state_dict(), os.path.join("./models", decoder_model_name))

Plotting the losses.


# plot the training progress
plt.plot(range(0, len(losses)), losses)
plt.xlabel('[training epoch]')
plt.xlim([0, len(losses)])
plt.ylabel('[reconstruction-error]')
#plt.ylim([0.0, 1.0])
plt.title('AENN training performance')

This completes our training. Now let’s look at how to leverage our models to get predictions.

Load the pre-trained models.


# restore pretrained model checkpoint
encoder_model_name = "ep_8_encoder_model.pth"
decoder_model_name = "ep_8_decoder_model.pth"
# init training network classes / architectures
encoder_eval = encoder()
decoder_eval = decoder()
# load trained models
encoder_eval.load_state_dict(torch.load(os.path.join("models", encoder_model_name)))
decoder_eval.load_state_dict(torch.load(os.path.join("models", decoder_model_name)))

Perform the reconstruction for whole data.


# convert encoded transactional data to torch Variable
data = autograd.Variable(torch_dataset)
# set networks in evaluation mode (don't apply dropout)
encoder_eval.eval()
decoder_eval.eval()
# reconstruct encoded transactional data
reconstruction = decoder_eval(encoder_eval(data))

Get the reconstruction losses for whole data.


# determine reconstruction loss - all transactions
reconstruction_loss_all = loss_function(reconstruction, data)
print(reconstruction_loss_all)
reconstruction loss: 0.0034663924

Determine reconstruction loss for individual transactions.


# init binary cross entropy errors
reconstruction_loss_transaction = np.zeros(reconstruction.size()[0])
# iterate over all detailed reconstructions
for i in range(0, reconstruction.size()[0]):
# determine reconstruction loss - individual transactions
    reconstruction_loss_transaction[i] = loss_function(reconstruction[i], data[i]).item()

Plot the data points in accordance with there reconstruction losses attached with there labels.


# prepare plot
fig = plt.figure()
ax = fig.add_subplot(111)
# assign unique id to transactions
plot_data = np.column_stack((np.arange(len(reconstruction_loss_transaction)), reconstruction_loss_transaction))
# obtain regular transactions as well as global and local anomalies
regular_data = plot_data[label == 'regular']
global_outliers = plot_data[label == 'global']
local_outliers = plot_data[label == 'local']
# plot reconstruction error scatter plot
ax.scatter(regular_data[:, 0], regular_data[:, 1], c='C0', alpha=0.4, marker="o", label='regular') # plot regular transactions
ax.scatter(global_outliers[:, 0], global_outliers[:, 1], c='C1', marker="^", label='global') # plot global outliers
ax.scatter(local_outliers[:, 0], local_outliers[:, 1], c='C2', marker="^", label='local') # plot local outliers
# add plot legend of transaction classes
ax.legend(loc='best')

The plot shows how the chosen approach elegantly found the anomalies from a highly biased dataset. Let’s look at how many anomalies were identified.


ad_dataset['label'] = label
ad_dataset[reconstruction_loss_transaction >= 0.1].label.value_counts()
Out[#]: global    59
        local      2
        Name: label, dtype: int64
ad_dataset[(reconstruction_loss_transaction >= 0.018) & (reconstruction_loss_transaction < 0.05)].label.value_counts()
Out[#]: local   23
        Name: label, dtype: int64

As you see, out of 70 global, 59 were detected which is 84% and out of 30 local, 23 have been detected which is 76.6%. That’s far better performance than any other older techniques considering outliers were only 0.018% of the whole data.

Here is the Github link for code implementation along with the dataset.

I hope this gives a clear understanding of the approach and how to implement it.

Conclusion

This concludes that applying deep learning algorithms on classical structured data machine learning problems will give promising results if designed well. Identifying the right algorithm, appropriate loss function and ideal dataset can help data scientists tap into deep learning and leverage its capabilities to boost performances on age-old approaches. The use case mentioned in this article is on financial transactions but the very idea of deep anomaly detection can be extended to other domains like manufacturing and marketing.