Using your own algorithms with hyperparameter optimization in AWS SageMaker

TL;DR

  • AWS SageMaker provides a number of standard Machine Learning algorithms in containerized form, so you can pull those algorithms down onto a large EC2 instance and just run, with minimal effort.
  • AWS SageMaker also provides a hyperparameter optimization functionality that pretty much runs ‘out-of-the-box’ with the algorithms provided.
  • You can run your own algorithms within SageMaker if you containerize your algorithm code.
  • I wanted to find out if it was possible to easily combine the ‘run-your-own-containerized-algorithm’ functionality with the ‘out-the-box’ hyperparameter optimization functionality in SageMaker. It is. It was a straight-forward, but slightly lengthy process.

Introduction

<DISCLAIMER> This is a blog-post I started back in Autumn/Winter 2019. I knew it would be a fairly length post but one I was keen to write. But then, well, a pandemic got in the way and its taken a while to get back to writing blog posts. I still believe there are some useful learnings here – I hope you do too </DISCLAIMER>.


Back in 2019 I was using SageMaker a lot, including running an AWS Machine Learning Immersion Day at Infinity Works. One of the things I like about SageMaker is how the resources used to do any heavy lifting in training a model are separated from the resources supporting the Jupyter notebook. The SageMaker service provides several standard Machine Learning algorithms (e.g. Random Forests, XGBoost) in containers. This means it is possible to explore a dataset and develop an modelling approach in a Jupyter notebook that runs on one EC2 instance, and then when we want to scale-up the training process to the full dataset we can pull down the relevant container from ECR and run the training process on a separate much larger instance. Provisioning of heavier infrastructure needed for training on the full large dataset is only done when it is needed and you only pay for what you use of those larger EC2 instances. A Data Scientist like me doesn’t have to worry about the provisioning of the larger EC2 instance, it is handled by through simple configuration options when configuring the training job. It is also possible to configure a hyper-parameter optimization job in a similar way, so that multiple training jobs (with different hyper-parameter values) can be easily run, potentially in parallel, on large EC2 instances just by adjusting a few lines of json config.

So far, so good. As a Data Scientist the pain of getting access to or configuring compute resource has been removed and training on really large datasets is almost as easy as exploring a smaller dataset in a Jupyter notebook running on my local machine. But are we restricted to only using the algorithms that AWS has containerized? This is where it get more interesting and fun. You can use any algorithm that is available in a container in ECS. That means you can develop/code up your own algorithm/training process, containerize it, and then run that algorithm using multiple large EC2 instances with minimal config.

AWS have an example of how to containerize your own algorithm and deploy it to an endpoint. The git repo is here. The AWS team use the example of a scikit-learn decision tree trained on the Iris dataset (I know, why do examples not use something more original than the Iris dataset).

What I wanted to explore was,

  • How easy was it to actually containerize my own algorithm for use in SageMaker,
  • How easy was it to combine my containerized algorithm with the easy to configure hyperparameter optimization capability already present in SageMaker.

The rest of this post is about what I learnt in exploring those two questions, in particular the second of those. The first question is essentially already answered by the original AWS repo. What I wanted to learn was could I easily use my own algorithm with the out-the-box hyperparameter optimization functionality that SageMaker provided, or was the easy-to-use hyperparameter optimization functionality essentially restricted to the in-built SageMaker algorithms? What I’ll cover is,

  1. The choice of algorithm we’re going to containerize
  2. The basics of building the Docker container
  3. Pushing the container to the AWS container registry
  4. Using the containerized algorithm within a SageMaker notebook
  5. Running hyperparameter optimization jobs using the containerized algorithm.

If you want to follow the technical details, I would suggest that you first become familiar with the basics of AWS SageMaker – tutorial here. You may also want to look at the basics of hyperparameter tuning for one of the standard machine learning algorithms within SageMaker, as I’ll be assuming some of this background knowledge is known to you or at least you can pick it up quickly – to fully explain all the SageMaker background material would make this an even longer blog. You can find explanations of how to configure and run a SageMaker hyperparameter tuning job here and here.

Now, let’s start with the first of our questions.

Algorithm choice

I wanted to use an algorithm that wasn’t already available within SageMaker, otherwise what would be the point of going through this exercise? I have been doing some work recently on Gaussian Processes (GPs), in particular with kernel functions that are composite functions.

I won’t explicitly cover the basics of GPs here – the blog post is long enough already. Instead I will point you towards the excellent book by Carl Rasmussen and Chris Williams and this tutorial from Neil Lawrence. However, I will say briefly what my interest in GPs is. Gaussian Processes have an interesting connection with large (wide) Neural Networks. This connection was discovered by Chris Williams and Radford Neal. I wrote some GP code, on the basis of the Williams’ paper, that made it into commercial software (my first ever example) back in 1999 (yes – I am that old, and have been working in Machine Learning that long). More recently, the connection has been extended to link Deep Learning Neural Networks and Gaussian Processes (see for example, here and here). Cho & Saul did some nice early work in this area, using dot-product kernels that are composite functions. It is the dot-product kernels derived by Cho & Saul that I’ll use here for my example algorithm, as the kernels are of relatively simple form, and yet are specified in terms of a few simple parameters that we can regard as hyper-parameters. For the purposes of this blog on AWS SageMaker it is not important to know what the Cho & Saul kernels might represent, merely how they are defined mathematically. So let’s start there,

For this illustration we are focusing on datapoints on the the surface of the unit hypersphere, i.e {\bf x} \in \mathbb{R}^{d} with ||{\bf x}||_{2}^{2}\;=\;1 . We then consider a set of kernels, K_{q,l}\left (  {\bf x}_{1}, {\bf x}_{2} \right) , defined via,

K_{q,l}\left ({\bf x}_{1}, {\bf x}_{2} \right )\;=\;  k_{q,l}\left ( {\bf x}_{1}\cdot {\bf x}_{2} \right )

The dot-product kernels k_{q,l}(t) are defined iteratively,

k_{q,l+1}(t)\;=\;k_{q,0}\left ( k_{q, l}(t)\right )

The base kernels k_{q,0}(t) are constructed from,

k_{q,0}(t)\;=\; J_{q}\left ( \arccos (t)\right ) / J_{q}\left ( 0\right )

with,

J_{q}\left ( \theta \right )\;=\;(-1)^{q}\left (\sin\theta\right)^{2q+1}\left ( \frac{1}{\sin\theta}\frac{\partial}{\partial \theta}\right )^{q} \left ( \frac{\pi-\theta}{\sin\theta}\right )

Choosing a particular kernel then boils down to making a choice for q and l. Once we have made choice of kernel, we can train our model. For simplicity, I have defined the model training here to be simply the process of constructing the Gram matrix from the training data, i.e. the process of calculating the matrix elements,

M_{ij} = \sigma^{2}\delta_{ij}\;+\;K_{q,l}\left ( {\bf x}_{i}, {\bf x}_{j}\right)

Here, σ2 is the variance of the additive Gaussian noise that we consider present in the response variable, and {\bf x}_{i}\;,\; i=1,2,\ldots,N , are the feature vectors for the N datapoints in the training set. Along with the training feature vectors we also have the response variable values, y_{i} .

Whilst it may not match the more traditional concept of model training – there is no iterative process to minimize some cost function – I am using the training data to construct a mathematical object required for calculating the expectation of the response variable conditional on the input features. Within a Gaussian Process it is considered usual to optimize any parameters of the covariance kernel as part of the model training. In this case, for simplicity, and for purposes of illustrating the hyperparameter tuning capabilities of SageMaker, I wanted to consider the kernel parameters q,l and σ2 as hyperparameters, essentially leaving no remaining kernel parameters to be optimized during the model training.

Once we have the matrix {\bf M} defined, we can calculate a prediction for the response variable at a new feature vector {\bf x}_{\star} via the formula,

\mathbb{E}\left ( y\left ( {\bf x}_{\star}\right )\right )\;=\;{\bf v}\left ( {\bf x}_{\star}\right )^{\top}{\bf M}^{-1}{\bf y}\,\,

where {\bf y} is the vector of response values in the training set, and the vector {\bf v}\left ( {\bf  x}_{\star} \right )\;=\; (v_{1}, v_{2}, \ldots, v_{N}), with the element v_{i}\left ( {\bf x}_{\star}\right ) given by,

v_{i}\;=\; k_{q,l}\left ( {\bf x}_{\star}\cdot {\bf x}_{i}\right )

Now we have given the mathematical definition of our algorithm, we need to focus on code. Following the example in the original AWS repo we need python code that,

  1. Defines a class for a trained GP model. I have called my class, unsurprisingly, trainedGPModel . Instantiating an instance of this class by passing the training data to the class constructor method, runs the Gram matrix calculation process mentioned earlier. Within my trainedGPModel class I also have a method predict(xstar) that returns the predicted expectation of the response variable given an input datapoint xstar. The code for the trainedGPModel class implements the linear algebra formulae given above and so is straight-forward.
  2. We also need code that runs the training process. This code is held in a file called train. I made minimal modifications to the train module in the original AWS repo. The main change I made was including code to make predictions on a validation dataset, and from that calculating the Root-Mean-Squared-Error (RMSE) on the validation dataset. The validation RMSE is the metric I will use for hyperparameter tuning and so I have to write the validation RMSE value to stdout so that it can get picked up by the SageMaker hyperparameter tuning process. I had to write the RMSE value with a string prefix and delimiter, e.g.
print( "validation:RMSE=" + str(RMSE_validation) + ";" )

with a corresponding matching regex in the configuration of the hyperparameter tuning job – see later section on running the containerized algorithm in a SageMaker notebook. It wasn’t obvious that I needed to write the validation metric in this way, and it took a bit of googling to work out. Most SageMaker links on hyperparameter tuning point to this page , but the detail on how the metric is passed between your algorithm code and the SageMaker hyperparameter optimization code is actually explained in this SageMaker documentation page.

Docker basics

Now let’s talk about putting our code in a container. We need to construct a Docker compose file. For a refresher on Docker I found this tutorial by Márk Takács to be really helpful. I actually use a Windows machine for my work, so I’m running Docker Desktop. However, I also use WSL (Windows Subsystem for Linux) for when I want a linux like environment. Although you can install a Docker client under WSL, you still have to make use of the native Docker daemon of Docker Desktop. I found this guide from Nick Janetakis on getting the WSL Docker client working with Docker Desktop invaluable, particularly the configuring of where WSL mounts the Windows file system (by editing the /etc/wsl.config file) so that I can then easily mount any sub-directory of my Windows file system to any point I choose in the container image when testing the Docker file locally.

I won’t go through the aspects of testing the container locally – you can read the original AWS repo to see that. Instead we’ll just go through the Docker file for building the final SageMaker container. The Docker file is fairly simple and other that changing it to use a python3 runtime (see lines 9&10)  we have not changed anything else in the Docker file in the original AWS repo. Line 36 of the Docker file is where we copy across our algorithm code into pre-specified directory in the image that SageMaker will look for when running the containerized algorithm.


# Build an image that can do training and inference in SageMaker
# This is a Python 3 image that uses the nginx, gunicorn, flask stack
# for serving inferences in a stable way.

FROM ubuntu:18.04

RUN apt-get -y update && apt-get install -y --no-install-recommends \
wget \
python3 \
python3-pip \
nginx \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*

# Here we get all python packages.
# There's substantial overlap between scipy and numpy that we eliminate by
# linking them together. Likewise, pip leaves the install caches populated which uses
# a significant amount of space. These optimizations save a fair amount of space in the
# image, which reduces start up time.
RUN pip3 install numpy scipy scikit-learn pandas flask gevent gunicorn &amp;amp;&amp;amp; \
(cd /usr/local/lib/python3.6/dist-packages/scipy/.libs; rm *; ln ../../numpy/.libs/* .) && \
rm -rf /root/.cache

RUN pip3 install setuptools

# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard
# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE
# keeps Python from writing the .pyc files which are unnecessary in this case. We also update
# PATH so that the train and serve programs are found when the container is invoked.

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

# Set up the program in the image
COPY gaussian_processes /opt/program
WORKDIR /opt/program

Pushing the container to AWS

We can now push our Docker container to the AWS ECR (Elastic Container Registry). This is simple using the AWS CLI (command line interface) and the build_and_push.sh shell script provided in the original AWS repo. Within the shell script we have just modified on lines 16 and 17 the name of the top-level directory in which our training and prediction code resides,

image=$1

if [ "$image" == "" ]
then
    echo "Usage: $0 "
    exit 1
fi

chmod +x gaussian_processes/train
chmod +x gaussian_processes/serve

Then we just run shell script, passing the name of the container we have just built as a command line argument,

./build_and_push.sh gpsagemaker

After running the shell script we can see the container present in the AWS ECR,

Screenshot of our Gaussian Process SageMaker Docker container in AWS Elastic Container Registry (ECR) – ready to use within a SageMaker notebook.

Using the containerized algorithm in SageMaker

Now we have the container, that has our GP code, in AWS ECR we can use it within a SageMaker notebook. Let’s do so. For this I’m just going to adapt the notebook within the original AWS repo. I go to the Sagemaker under ‘ML’ in the list of AWS services and from there I can start/create my SageMaker notebook instance. Once the notebook instance is ready I can open up a Jupyter notebook as usual,

The first main difference is that we’ll create some simple small-scale simulated training and validation data. Our goal here is to test how easy it is to containerize and use our own algorithm, not build a perfect model. Our generative model is a simple one – a linear model, dependent on just two features (with coefficients that we have chosen as 1.5 and 5.2 respectively). We use this simple model to create the response variable values and then add some Gaussian random noise (of unit variance).


# create training and validation sets
nTrain = 100
X_train = np.random.randn( nTrain, 2 )
y_train = (1.5 * X_train[:, 0]) + (5.2*X_train[:,1]) + np.random.randn( nTrain )
y_train.shape = (nTrain, 1)
data_train = np.concatenate( (y_train, X_train), axis=1)
df_data_train = pd.DataFrame( data_train )

nValidation = 50
X_validation = np.random.randn( nValidation,2 )
y_validation = (1.5 * X_validation[:, 0]) + (5.2*X_validation[:,1]) + np.random.randn( nValidation )
y_validation.shape = ( nValidation, 1 )
data_validation = np.concatenate( (y_validation, X_validation), axis=1)
df_data_validation = pd.DataFrame( data_validation )

We then specify our account details and also the image that contains our Gaussian Process algorithm.


account = boto3.client('sts').get_caller_identity()['Account']
region = boto3.session.Session().region_name
image = '{}.dkr.ecr.{}.amazonaws.com/gpsagemaker:latest'.format(account, region)

The next cell in our notebook then uploads the training and validation data to our s3 bucket,


# write training and validation sets to s3
from io import StringIO # python3; python2: BytesIO 
import boto3

bucket = mybucket

# write training set
csv_buffer = StringIO()
df_data_train.to_csv(csv_buffer, header=False, index=False)
s3_resource = boto3.resource('s3')
s3_resource.Bucket(bucket).Object('train/train_data.csv').put(Body=csv_buffer.getvalue())
csv_buffer.close()

# write validation set
csv_buffer = StringIO()
df_data_train.to_csv(csv_buffer, header=False, index=False)
s3_resource = boto3.resource('s3')
s3_resource.Bucket(bucket).Object('validation/validation_data.csv').put(Body=csv_buffer.getvalue())
csv_buffer.close()

Running a single training job

So first of all let’s just configure and run a single simple training job. Note the validation metric being specified along with the regex.


create_training_params = \
{
    "RoleArn": role,
    "TrainingJobName": job_name,
    "AlgorithmSpecification": {
        "TrainingImage": image,
        "TrainingInputMode": "File",
        "MetricDefinitions":[{"Name":"validation:RMSE",
                              "Regex":"validation:RMSE=(.*?);"    
        }]
    },

We also set values for the hyperparameters, which are static since we are just running a single training job and not doing any hyperparameter optimization yet.


    "HyperParameters": {
        "q":"0",
        "l":"2",
        "noise":"0.1"
    },

We can then run a training job using our containerized Gaussian Process code, just as we would any other algorithm available in SageMaker. We can see the training job running in the AWS Management console – click under “Training jobs” on the left hand side of the console. We can see the current training job ‘in progress’ and also an earlier completed training job that I ran.

Screenshot of a single SageMaker training job running using our GP algorithm code.

Running a hyperparameter tuning job

So that appears to run ok. So now we have our algorithm running in SageMaker ok, we can now just configure the SageMaker hyperparameter optimization wrapper and run one of the out-of-box SageMaker hyperparameter optimization algorithms over what we have specified as hyperparameter in our Gaussian Process code. The config for the hyperparameter tuning job is below – we have largely just modified slightly the examples in the original AWS repo and also followed the guidance. You can see that we have specified the RMSE metric on the validation set as the metric to optimize with respect to the hyperparameters. For illustration purposes we have specified that we want to optimize only over the q and l hyperparameters. The σ2 hyperparameter we have kept static at σ2=0.1. You can also see that we have specified to run 10 training jobs in total, i.e. we will evaluate the validation metric at 10 different combinations of the two hyperparameters, but we only run 3 training jobs in parallel at any one time.


# Define HyperParameterTuningJob
# We will only tune the learning rate by maximizing the AUC value of the 
# validation set. The hyperparameter search is a random one, using a sample of
# 10 training jobs - better methods for searching the hyperparameter space are 
# available, but for simplicty and demonstration purposes we will use the 
# random search method. Run a max of 3 training jobs in parallel
job_name = "gpsmbyo-hp-" + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
response = sm.create_hyper_parameter_tuning_job(
    HyperParameterTuningJobName=job_name,
    HyperParameterTuningJobConfig={
        'Strategy': 'Random',
        'HyperParameterTuningJobObjective': {
            'Type': 'Minimize',
            'MetricName': 'validation:RMSE'
        },
        'ResourceLimits': {
            'MaxNumberOfTrainingJobs': 10,
            'MaxParallelTrainingJobs': 3
        },
        'ParameterRanges': {
            'IntegerParameterRanges': [
            {
              "Name": "q",
              "MaxValue": "4",
              "MinValue": "0",
              "ScalingType": "Auto"
            },
            {
              "Name": "l",
              "MaxValue": "4",
              "MinValue": "1",
              "ScalingType": "Auto"
            }    
        ]}
    },
    TrainingJobDefinition={
        'StaticHyperParameters': {
            "noise":"0.1"
        },
        'AlgorithmSpecification': {
        'TrainingImage': image,
        'TrainingInputMode': "File",
        'MetricDefinitions':[{"Name":"validation:RMSE",
                              "Regex":"validation:RMSE=(.*?);"
                             }]
        }

If we then look at our AWS console (screenshot below) we can see the hyperparameter tuning job running, along with previous completed tuning jobs.

Screen shot of AWS console showing current and previous hyperparameter tuning jobs.

We can also see the individual training jobs, corresponding to that tuning job, running (screenshot below). Remember that the hyperparameter tuning job is just a series of individual evaluations of the validation metrics, run at combinations of (q,l) specified the tuning algorithm. From the screenshot we can see that there are 3 training jobs running, in accordance with what we specified in the tuning job config.

Screenshot of the 3 training jobs running as part of the hyperparameter tuning job.

Once the tuning job has completed, we can retrieve the validation metric values for the 10 different hyperparameter combinations that were tried, to see which combination of q and l gave the smallest RMSE on the validation set.

Summary

The two questions I was trying to address were,

  1. How difficult is it to create your own algorithm to use in SageMaker?
  2. How easy is it to use the hyperparameter optimization algorithms available in SageMaker with your new algorithm?

The answer to both questions is, “it is a relatively easy but lengthy process”. That it is a lengthy process is understandable – SageMaker gives you a functionality to apply out-the-box hyperparameter tuning on an algorithm/code that it knows nothing about until runtime. Therefore there has to be a lot of standardized syntax in specifying how that algorithm is structured and called as a piece of code. Fortunately, all the details of how to structure your algorithm and create the Docker container are in the excellent example given in the AWS repo and the documentation. The only complaint I would have is that it would be good for the repo to have an example showing how your own algorithm can utilize the hyperparameter optimization functionality of SageMaker – hence this blog. Working out the few remaining steps to get the hyperparameter optimization working with my Gaussian Process code was not very difficult, but not easy either.

The example algorithm I have chosen is very simplistic – the training process literally only involves the calculation and inversion of a matrix. A full training process could involve optimization of, say, the log-likelihood with respect to the parameters of the kernel function, but explaining the extra details would make this blog even long. Secondly, we only needed a simple/minimal training process to address the two questions above. Likewise, we have not illustrated our new trained algorithm being used to serve predictions – this is very well illustrated in the original repo and I would not be adding any new with my Gaussian Process example.

Featured

The different types of data I have encountered as a Data Scientist

TL;DR

Where Data or Data Science is the business model or primary purpose of an organization you can expect data and the data eco-system to be properly invested in. In scientific research and industrial settings this will be common. In the commercial world this is not always the case. As a Data Scientist in the commercial world you should learn to ask, ‘Does this company need data? Does this company actually need Data Science?’ If the answers are not a resounding yes, then beware. Sometimes you won’t be able to answer those questions until you’re up close and inside the organization, but there are indicators that suggest upfront whether an organization is likely to have good quality data and a functioning data eco-system, or whether it will be a pile of trash. In the long read below I outline the different kinds of data I’ve encountered across the commercial and non-commercial sectors I’ve worked in, and what signs I’ve learnt to look for.

Long version

Over 20 years ago, as I was just starting research in the Bioinformatics field, a colleague explained to me the challenges of working with high-throughput biological experimental data. The data, he commented, was typically high-dimensional and very messy. “Messy” was new to me. With a PhD in Theoretical Physics and post-doctoral research in materials modelling, the naïve physicist in me thought, ‘…ah…messy…you mean Gaussian errors with large variance”. I was very wrong.  

I learnt that “messy” could mean a high-frequency of missing data, missing meta-data, and mis-labelled data. Sometimes the data could be mis-labelled at source because the original biological material had been mis-labelled – I remember one interesting experience having to persuade my experimental collaborator to go back to the -80°C freezer to check the lung tissue sample because a mis-labelled sample was the only remaining explanation for the anomalous pattern we were seeing in a PCA plot, and yes something had gone wrong in the LIMS because the bar-code associated with the assay data did not match the bar-code on the sample in the freezer.

However, as I moved from the academic sphere into the commercial realm (10 years in the commercial sector now in 2021), I’ve learnt that data errors and issues can be much more varied and challenging, and varies from sector to sector, from domain to domain. BUT…. I have seen that there are broad classes that explain the patterns of data issues I have experienced over 30yrs of mathematical and statistical modelling. In this piece I am going to outline what those broad classes and patterns are, and more importantly how to recognize when they are likely to arise. The broad classes of data are,

  1. Scientific or Experimental – here, data is collected for the purposes of being analysed, and so therefore is optimized towards those purposes.
  2. Sensor data – here, data is collected for the purposes of detecting signals, possibly diagnostic, i.e., working out what the cause of a problem is, or being monitored, i.e., detecting a problem in the first place. It is intended primarily to be automatically monitored/processed, but not necessarily analysed by a human (Data Scientist) in the loop. The data lends itself to large-scale analysis but may not be in a form that is optimal or friendly for doing data science.
  3. Commercial operational data – this is data that is stored, rather than actively collected. It is stored, usually initially on a temporary (short-term), for the purposes of running the business. It could be server event data from applications supporting online platforms or retail sites, or customer transaction data, Google ad-click data, marketing spend data, or financial data.

It is this last category that is perhaps the most interesting – quite obviously if you are a Data Scientist working in a commercial sector. For commercial operational data, not only can the data contain the usual errors but there can be a range of additional challenges – columns/fields in tables containing completely different data than they are supposed to because a field has been over-loaded, or re-purposed for a different need without going through a modification/re-design of the schema. Without an accurate updated data-dictionary, this knowledge about the true content of the fields in a table resides as, ‘folk knowledge’ in the heads of the longer serving members of staff – until they leave the company, and that knowledge is lost.

I have my own favourite stories about data issues I have encountered in various organizations I have worked for – such as the time I worked with some server-side event data whose unique id for each event turned out only to be unique for a day, because the original developer didn’t think anyone would be interested in the data beyond a day.

Every commercial data scientist will be able to tell similar ‘war stories’, often exchanged over a few beers with colleagues after work. Such war stories will not go away any time soon. The more important point is how do we recognize the situations where they are more likely to occur? The contrast I have already highlighted above, between the different classes of data, gives us a clue – where data is not seen as the most valuable part of the process, nor the primary purpose of the process, it will be relatively less invested in and typically of lower quality.

For example, in the first two categories, and in the scientific realm in particular, data is generated with the idea, from the outset, that it is a valuable asset; the data is often the end itself. For example, the data collected from a scientific experiment provides the basis for scientific insight and discoveries, and potentially future IP; the data collected to monitor complex technical systems saves operating costs by minimizing downtime of those systems and potentially identifying improved efficiencies. Poor quality data, or inadequate systems for handling data has an immediate and often critical impact upon the primary goal of the organization or project, and so there is an immediate incentive to rectify errors and data quality issues and to invest in systems that enable efficient capture and processing of the data.

Some commercial organizations also effectively fall into these first two classes. For companies where the data and/or data science is the product from the outset, or the potential value of a data stream has been realized early on, then there is an incentive to invest in efficient and effective data eco-systems, with the consequence benefit in terms of data quality – if sufficient investment in data and data systems is not made, a company whose main revenue stream is the data will quickly go out of business.

In contrast, for most commercial organizations, the potential value of commercial data may be secondary to its original purpose, and so data quality from the perspective of these secondary uses may be poor, even if the value or future revenue streams attached to these secondary uses may be much greater than the original primary use of the data. For such companies, the general importance of data may be recognized, but that does not mean that data eco-systems are getting the right sort of investment. For example, for a company providing a B2C platform, data on the behaviour of consumers can provide potentially actionable and important insight, but ultimately it is the platform itself (and its continued operation 24/7) that is of prime importance. Similarly, for an online retail site, the primary concern is volume of transactions and shipping. Consequently, for these organizations the data issues that arise are richer, more colorful, and often more challenging. This is because poor quality data and systems do not immediately threaten the viability of the business, or main revenue streams. Long-term, yes, there will be an impact upon UX and overall consumer satisfaction, but many operations are happy to take that hit and counter it by attempting to increase transaction volume.

For these organizations it may be recognized that data is important enough for capital to be spent on data eco-systems, but that capital investment may be poorly thought through or not joined-up. Again, there are several symptoms I have learnt to recognize as red flags. Dividing these into symptoms related to strategy and tactical related symptoms, they are,

  • Strategy related symptoms:
    1. Lack of a Data Strategy. The value of data, even the potential of value has not been recognized or realistically thought through. Consequently, there is unlikely to be any strategy underpinning the design of the data eco-system.
    2. Any ‘strategy’ is at best aspirational, being driven top-down by an Exec.  The capital investment is there, but no joined up plan. The organization sets up a Data Science function because other organizations are – typified by hiring of a ‘trophy Data Scientist’. Dig deeper and you will find no real pull from the business areas for a Data Science function. Business areas have been sold Data Science as a panacea, so have superficially bought into the aspirational strategy – who in a business function would not agree when told that the new junior Data Science hires will analyse the poorly curated or non-existent data and will come up with new product ideas that have a perfect market fit and will be coded into a fully productionized model. I have seen product owners who have been sold a vision and believe that they can now just put their feet up because the new Data Scientist will do it all.
    3. The organization cannot really articulate why it needs a Data Science function when asked – this is one of my favourites. Asking ‘why’ here is the equivalent of applying the ‘Five Whys’ technique at the enterprise-level rather than at a project-level. Many companies may respond to the question with the cliched answer, ‘because we want to understand our customers better, which will help us be more profitable’. Ok, but where are the examples of when not understanding this particular aspect your customers has actually hurt the business. If there is no answer given, you know that the need for an machine learning based solution to this task is perceived rather than evidenced.
    4. A mentality exists that data is solely for processing/transacting, not analysing – hence there is little investment in analytics platforms and tooling, resulting in significant friction in getting analytics tooling close to data (or vice-versa). Possibly there is not even a recognition that there is distinction between analytics systems and reporting systems. Business areas may believe that only data querying skills and capability are needed, so the idea that you want to make building of predictive models as frictionless as possible for the Data Science team, is an alien one.
    5. Lack of balance in investments between Data Engineering and Data Science – this is not optimal whichever function gets the bigger share of the funding. This is often mirrored by a lack of linking up between the Data Science team and the Data Engineering team, sometimes resulting in open hostility between the two.
    6. Lack of Data Literacy in business area owners or product owners. Despite this, organizations often have a belief that they will be able to roll-out a ‘self-serve’ analytics solution across the enterprise that will somehow deliver magical business gains.
  • Tactical related symptoms:
    1. Users in business areas are hacking together processes, which then take on a life of their own. This leads to multiple conflicting solutions, most of which are very fragile, and some of which get re-invented multiple times.
    2. Business and product owners ask for Data Science/Machine Learning driven solutions but can’t answer how they are going to consume the outputs of the solution.
    3. The analytics function, whether ad-hoc or formal, primarily focuses on getting the numbers ready for the month-end report. These numbers are then only briefly discussed, or ignored by senior management, because they conflict with data from other ad-hoc reports built from other internal data sources, or they are too volatile to be credible, or they conflict with the prior beliefs of senior management.
    4. Business and product owners consider it acceptable/normal to use a Data Analyst or Data Scientist for, ‘can you just pull this data for me’ tasks.

As a Data Scientists what can you do if you find yourself in a scenario where these symptoms arise? There are two potential ways to approach this,

How do we respond to it – what can we do reactively? Flag it up – complain when systems don’t work and highlight when it they do. Confession here – I’ve not always been good at complaining myself, so this is definitely a case of me recognizing what I should have done, not what I did. On the less political front, you can try and always construct your analytical processes to minimize the impact of any upstream processes not under your control. By this I mean making downstream analysis pipelines more robust by building in automated checks for data quality and domain specific consistency. Where this is not possible, then stop or refuse to work on those processes that are so impacted as to be worthless. This last point is about making others share in your pain or shifting/transferring the pain elsewhere – onto business/product owners, from insight to operational teams. If other teams also experience the pain of poor-quality data, or poorly designed data-ecosystems and analytical platforms, then they have an incentive to help you try and fix the issues. Again, on a less political front, make sure expectations take into account the reality of what you have to deal with. Get realistic SLAs re-negotiated/agreed that reflect the reality of the data you are having to work with – this will protect you and highlight what could be done if the data were better.

How can we prevent it – what can we do proactively? The most effective way for any organization to avoid such issues is to have a clear and joined up Data Strategy and Analytics Strategy. It is important for an organization to recognize that having a Data Strategy does not mean it has an Analytics Strategy. Better still, an organization should have a clear understanding of what informed decisions underpin the future operations of its business model. Understanding this will drive/focus the identification of the correct data needed to support those decisions and hence will naturally lead (via pull) to investment in high-quality data capture processes and efficient data engineering/data analysis infrastructure. Formulating the Data Strategy may not formally be part of your brief, but it is likely that, as a Data Scientist and therefore end-user, you can influence it. This is particularly important in order to make sure that the analytical capability is a first-class and integrated citizen of any Data Strategy. This can be done by giving examples of the cost/pain when it is not so. Very often, a CDO or CTO formulating the Data Strategy will come from an Engineering background and so the Data Strategy will be biased towards Ops. Any awareness of a need for the Data Strategy to support analytics will be focused on near-term analytics, e.g., supporting real-time reporting capabilities. The need to for the Data Strategy to support future Ops by supporting insight, innovation and discovery capabilities may be omitted. If a Data Science function is considered, it will be as a bolt-on – possibly a well-funded Data Science bolt-on, but still a bolt-on.

Demand Forecasting at Amazon

AmazonDemandModelBlog_FrontPicture

This is a post that I’ve been meaning to write for a while. Having worked on demand forecasting in the past, I was intrigued when I saw this paper posted on the arXiv pre-print archive from one of the research teams at Amazon.

Although it was obvious why Amazon would be interested in forecasting demand, I was intrigued that Amazon chose to use a state space model approach. About 6 months later I attended the ISBIS2018 conference, at which Lindsay Berry from Duke University presented this paper that also used a state-space model approach to modelling the demand for supermarket goods. I also subsequently became aware of this technical report from Ivan Svetunkov at Lancaster University’s Management School.

With three pre-prints on demand forecasting that all utilised a state-space modelling approach I thought it would be interesting to do a post summarizing the work from the Amazon team. I may get round to doing a further post on the other two pre-prints at a later date.

At this point it is worth explaining a bit about demand models in general. Demand models are statistical models that are usually built from 2-5yrs worth of historical sales data. A demand model enables us to forecast how many units of a product we will sell given the price of the product and a number of other variables. The models allow us to forecast a number of ‘what-if’ scenarios, e.g. what will happen to my sales if I reduce the product price by 20%? Ultimately, the demand model can enable us to determine what is the optimal price we should charge for a product depending on what business KPI we want to optimize. Some traditional approaches to demand modelling use a log-log model, with the log of the demand of an item being linear in the log of the price of the item. These models are of the Working-Leser type of demand models1,2. For goods with large demand volumes, a log-log model form will be a reasonable assumption as the natural quantum of demand (a single unit) is much smaller than the typical level of demand and so we can consider the demand of such goods as effectively being continuous random variables.

The actual problem the Amazon team tackled was forecasting of intermittent demand, i.e. for what are commonly called, ‘slow-moving goods’, whose pattern of sales are bursty. We might typically only sell, on average, less than say 10 units a week for these kinds of products. There may be no sales for a week or two, followed by a burst of sales concentrated in a few days. Snyder et al give a good modern review of the problem of intermittent demand forecasting3.

For such products, the traditional log-log type demand models can perform poorly, as we are dealing with products that sell only a few units per time period. However, there is no consensus approach to modelling such products, but this means it is an area ripe for novel and innovative methods. The  paper by Seeger et al  combines 3 interesting ideas,

  1. A multi-stage model – this means decomposing the modelling of demand into several models that cover different demand sizes. In this case separate models are constructed for when the expected demand is 0 units, 1 unit, and >1 unit.
  2. The combining of the multi-stage model with a state-space model. This has the effect of introducing exponential smoothing and hence some temporal continuity to the modelled demand.
  3. The use of a Kalman-filter approach to location of the mode when using a Laplace approximation to approximate a marginal posterior. This third innovation is the most technical, but, for me, also the most interesting.

The first of these innovations is not necessarily that much of a step-change. Other attempts to model slow-moving goods have also considered a mixture of distributions/processes to allow for the zero-inflation that one has in the weekly observed sales of a slow-moving good. Seeger et al use a three stage model, so that we have 3 latent functions,

y_{t}^{(0)}(x), which is used in modelling the probability of zero sales at time point t

y_{t}^{(1)}(x), which is used in modelling the probability of a single unit being sold at time point t

y_{t}^{(2)}(x), which is used in modelling the distribution of units sold at time point t, given the number of units is greater than 1.

The second innovation is an interesting one. Whilst I had come across the use of self-excitation (Hawkes processes) to model the bursty behaviour of intermittent demand, I hadn’t seen temporal continuity enforced via a latent state contribution to the linear predictors of the mixture components. For demand greater than a single unit Seeger et al model the demand {z_{t}} at time point t as following a Poisson distribution,

P\left ( z_{t}-2 | y^{(2)}_{t}\right )\;=\; \frac{1}{(z_{t}-2)!}\lambda( y^{(2)}_{t}  )^{z_{t}-2}\exp\left ( -\lambda ( y^{(2)}_{t} )\right )\;\;.

Here \lambda(\cdot) is a transfer function. The latent function y^{(2)}_{t} depends upon a latent state {\boldsymbol  l}_{t} and it is this latent state that is governed by a Kalman filter. Overall the latent process is,

y^{(2)}_{t}\;=\; {\boldsymbol a}^{\top}{\boldsymbol l}_{t-1}\;+\; b_{t}\;\;,\;\;b_{t}\;=\;{\boldsymbol \omega}^{\top}{\boldsymbol x}_{t}\;\;,\;\;{\boldsymbol l}_{t}\;=\;{\boldsymbol F}{\boldsymbol l}_{t-1}\;+\;{\boldsymbol g}_{t}\epsilon_{t}\;\;,\;\;\epsilon_{t}\sim N(0,1)

The latent variables \epsilon_{1}, \epsilon_{2},\ldots,\epsilon_{T-1}, {\boldsymbol l}_{0} have to be integrated out to yield a marginal posterior distribution that can then be maximized to obtain parameter estimates for the parameters than control the innovation vectors {\boldsymbol g}_{t}\;,t=1,\ldots,T-1.

It is the marginalization over \epsilon_{1}, \epsilon_{2},\ldots,\epsilon_{T-1}, {\boldsymbol l}_{0} that the third interesting technical innovation of Seeger et al is concerned with. The integration over \epsilon_{1}, \epsilon_{2},\ldots,\epsilon_{T-1}, {\boldsymbol l}_{0} is approximated using a Laplace approximation. The Laplace approximation simply replaces the exponent of the integrand by its second order Taylor expansion approximation, in order to approximate a complicated integration by an integration of a Gaussian. It is the simplest of a family of saddlepoint expansion techniques for obtaining asymptotic expansions of integrals (see for example the classic book by Wong).

The main task in a Laplace approximation is locating the maximum of the exponent of the integrand. Seeger et al do this via a Newton-Raphson procedure, i.e. expand the exponent to second order around the current estimate of the mode and then find the maximum of that second order approximation.

Consider a 1-dimensional example. Let q(x) be the function whose maximum, x_{*}, we are trying to locate. If the expansion of q(x) around our current estimate {\hat x}_{*} of x_{*} is,

q(x) \;=\; q( {\hat x}_{*} )\;+\; ( x - {\hat x}_{*}) q^{(1)}({\hat x}_{*})\;+\; \frac{1}{2}(x- {\hat x}_{*} )^{2}q^{(2)}({\hat x}_{*})\;+\; O\left ( (x-{\hat x}_{*})^{3}\right )

The updated estimate of x_{*} is then determined by maximizing the second order expansion above, and is given by,

{\hat x}_{*} \rightarrow {\hat x}_{*} \;-\; \frac{q^{(1)}( {\hat x}_{*})}{q^{(2)}( {\hat x}_{*})}

The schematic below depicts how the iterative Newton-Raphson procedure locates the maximum of a one-dimensional function.

NewtonRaphsonSchematic

The multi-dimensional equivalent update rule when we are maximizing a function q({\boldsymbol x}) of a vector {\boldsymbol x} is,

\hat{\boldsymbol x}_{*} \rightarrow \hat{\boldsymbol x}_{*} \;-\; {\boldsymbol H}^{-1}( \hat{\boldsymbol x}_{*})\,\nabla q (\hat {\boldsymbol x}_{*} )\;\;,

where {\boldsymbol H}( \hat{\boldsymbol x}_{*}) is the Hessian of q({\boldsymbol x}) evaluated at \hat{\boldsymbol x}_{*}\;\;.

As Seeger et al are marginalizing the posterior over \epsilon_{1}, \epsilon_{2},\ldots,\epsilon_{T-1}, {\boldsymbol l}_{0}, the Taylor expansion around any point is necessarily multi-variate, and so ordinarily, finding the maximum of that second order approximation would involve inverting the Hessian of the log-posterior evaluated at the current estimate of the mode. As the latent variables we are trying to marginalize over by doing the Laplace approximation are the T-1 innovations, \epsilon_{1},\;\ldots\;, \epsilon_{T-1} and {\boldsymbol l}_{0}, this means that each step of the Newton-Raphson procedure would involve the inversion of a T\times T matrix, i.e. an O\left(T^{3}\right) operation for each Newton-Raphson step. However, Seeger et al point out that once we have replaced the log-posterior by a second-order approximation, finding the maximum of that approximation is equivalent to finding the posterior mean of a linear-Gaussian state-space model, and this can be done using Kalman smoothing. This means in each Newton-Raphson step we need only run a Kalman filter calculation, an O\left( T \right) calculation, rather than a Hessian inversion calculation which would be O\left(T^{3}\right). When training on, say, 2 years of daily sales data with $T=730$, the speed-up will be significant. Seeger et al do point out that this trick of reducing the computation to one that scales linearly in T is already known within the statistics literature4, but not widely known within machine learning.

Seeger et al apply their methodology to a number of real-world scale datasets, for example to a ~40K item dataset with almost a year of historical data at the day-level. Overall run-times for the parameter learning are impressive (typically a few seconds for each of the separate demand model stages), though admittedly this is when running on a 150 node Spark cluster.

References

  1. Working, H. (1943) Statistical laws of family expenditure. J. Am. Statist. Ass., 38:43–56.
  2. Leser, C. E. V. (1963) Forms of Engel functions. Econometrica, 31, 694–703.
  3. Snyder, R., Ord, J., and Beaumont, A. (2012), Forecasting the intermittent demand for slow-moving inventories: A modelling approach. International Journal on Forecasting, 28:485-496
  4. Durbin, J. and Koopman, S. (2012), Time Series Analysis by State Space Methods. Oxford Statistical Sciences. Oxford University Press, 2nd Edition.

Defensive & Offensive Predictive Analytics

Harvard_business_review

This article from the Harvard Business Review was a short but interesting read. The article talks about defensive and offensive data management strategies – defensive being about minimizing downside risk, and being more common in regulated industries, whilst offensive data management strategies focus on supporting business objectives and are more common in un-regulated or less regulated industries. The authors rightly point out that the correct mix of defensive and offensive strategies will be specific to each individual organization.

Having worked in a number of commercial sectors, my experience is that the use of predictive analytics within organizations also divides into defensive and offensive activities, irrespective of whether that predictive analytics and data science activity is enabled by a well thought out data management strategy. There are good reasons for this, and again it is largely determined by whether or not the activity of an organization carries a large downside risk.

Consider a company, such as a bank, whose activity has a large downside risk, and where losses due to bad loans can almost wipe out a balance sheet very quickly. My experience of doing analytics in a UK retail bank is that the predictive analytics focus is on modelling that downside risk with a view to understanding it, forecasting it and ultimately mitigating against it. The analytics effort focuses on risk minimization (still an optimization), whilst optimization of the profit side of the P&L is less computationally based, e.g. by committees of human subject matter experts deciding mortgage or savings rates for the coming year.

In contrast, in companies where the downside risk is lower, such as those where transactions with the organization’s customers are on much shorter timescales than a bank, then the use of predictive modelling tends to focus more on the optimization of revenue and profits, rather than minimization of losses from liabilities. Take grocery supermarkets, where predictive demand models are used to set product prices in order to optimize profit. Whilst getting the pricing strategy wrong will impact revenues, it does not lead to the organization holding a long-term liability and is ultimately reversible. Mistakes when using predictive models in this domain are unlikely to take a company down.  

From what I have seen the use of predictive modelling within a business is typically almost binary, i.e. either predominantly on the downside risk, or predominantly on optimizing the business objectives, even though most businesses will have both upsides and downsides to their activity. I haven’t seen that many medium-scale organizations where predictive modelling is used at a mature level across the majority of the business areas or tasks. Even rarer in my experience are situations where predictive modelling of both downside risks and business objectives are done concurrently with the optimization taking into account both sides of the P&L. It would be interesting to find good examples outside, say the largest 50 companies in the FTSE100, DowJones, Nasdaq, or S&P500, where a more joined up approach is taken to using predictive analytics for optimizing the P&L.

pracma R package

I’ve used the pracma package in R for a while now. My main use, I’ll confess, is because it provides a convenient method for sampling orthonormal matrices. The first time I was faced with this task (16 years ago) I had to code up my own version of the Stewart algorithm in Java. The algorithm works by iteratively applying a series of Householder transformations – see for example the blog post by Rick Wicklin for a description and implementation of the algorithm. However, now that I predominantly use R and python the implementation of the Stewart algorithm in pracma is extremely handy and follows a similar form to the other random variate sampling functions in base R – see below,

library( pracma )

nDim <- 100 # set dimension of matrix
orthoMatrix <- rortho( nDim )

More recently I've been exploring the pracma package further, and I've been continually amazed how many useful little utility methods are available in the package – all the little methods for doing numerical scientific calculations that I was taught as an undergraduate  – methods that I have ended up coding from scratch multiple times. Ok, the package describes itself as 'practical numerical math routines' so I guess I shouldn't be surprised.

Practical PCA

PCA_Schematic

Principal Component Analysis (PCA) is a commonly applied algorithm in statistics and data science. Because it is so easy to understand at a high-level, and because it is so easy to apply, PCA has become ubiquitous. It is often applied without much thought and with the output rarely questioned. Typically, the questions I ask when applying PCA are,

  1. Do I need to do any transformation of the data before applying PCA?
  2. How many principal components should I select?
  3. How do I interpret the loadings (and scores)?

Unfortunately, I have seen a number of talks and presentations recently where PCA has been used and the impact on the analysis of not having thought about these questions was clear. Ok, I’m biased, as the behaviour of PCA, particularly when applied to high-dimensional data, is one of my areas of research. Whilst the research papers on PCA can be very complex, they do however provide some useful insight and guides on how to apply PCA to real data. In this post I’m going to look at those three questions in turn. In the following I’m also going to assume you are already familiar with PCA and that you are aware that the principal components are the eigenvectors of the sample covariance matrix corresponding to the largest eigenvalues. For a good introduction to PCA see the blog post by Laura Hamilton and also the classic book by Jolliffe.

TL;DR  – the the short answers to the questions above are, i) Check for outliers and/or heavy tails in your data before applying PCA, ii) use the ‘knee’ in the eigenvalue scree plot to select the number of components, iii) make sure you look at the loadings of the selected principal components and that you can explain any major patterns in those loadings.

1. Transformation of the data

The original derivations of PCA, such as that by Hotelling (1933), are heuristic and make no explicit assumptions other than the requirement that the selected components retain as much of the variance in the original data as possible.  The formal derivation of PCA as the maximum likelihood estimate of parameters of a probabilistic model assumes additive Gaussian noise – see for example the paper by Tipping and Bishop. In practice, where the distributions of latent factors and additive noise are still reasonably symmetric and decay sufficiently fast we would expect deviations from perfect Gaussians to have only a minor effect. Despite this I have seen a number of talks with PCA plots similar to that shown on the left below.

PCAScores_HeavyTailExample1

I have based this example on a real commercial data set I was shown, but for privacy and confidentiality reasons I have generated a simulated data set that reproduces the issue. The majority of PC1 scores are very squashed into the left hand side of the plot, with a few scores reaching much higher values than the rest. This can been seen more clearly if we look at the estimated density of the 1st PC scores – see the right hand plot above. A heavy tail is clearly present, meaning we are deviating significantly from the assumptions under which the eigenvectors of the sample covariance matrix are the optimal estimators of the signal directions in the raw data.

So an obvious first step would be to take a quick look at the distribution of the raw data that we are trying to decompose using PCA. If that distribution has a heavy tail or significant outliers then there is an argument for applying a transformation (e.g. logarithm) before applying PCA. If we take the log of the example data above then we obtain much more reasonable distributions for the PC scores – see below.

PCAScores_HeavyTailExample2

Tip: If you see an elongated distribution of scores along the PCs you have selected, then it may be worth going back and looking at the distribution of the raw data going into the PCA – you should have already looked at the distribution of your raw data anyway as part of EDA best practice.

2. Selection of the number of components

There are two common methods for selecting the number of components that most people will be familiar with or encounter. Those are,

  1. Select the number of components so that a fixed proportion, e.g. 90%, of the total variance is retained.
  2. Look at a scree plot of the eigenvalues and locate the ‘elbow’ (or ‘knee’ depending on your interpretation of anatomy).

Of those two methods it is the second that I always prefer. This is because it has a sound theoretical underpinning and is more robust when applied to the kind of high-dimensional data sets that are commonplace nowadays. Let me explain why. The goal of PCA is to select a small number of directions in the data that we believe capture the signal within the data. The first approach to PCA model selection is effectively based upon the assumption that the ‘signal’ contribution to the total variance is considerably greater than the noise contribution to the total variance, and thus by selecting to retain the majority of the total variance in the original data we believe we are effectively selecting components that represent signal.

If our data points are p dimensional, then we have p sample covariance eigenvalues, \lambda_{i},\;i=1,\ldots,p. We consider the signal part of the data is represented by a small number, k, of eigenvectors/eigenvalues, and we have the p-k-1 remaining non-zero eigenvalues that correspond to pure noise in the original data.

Typically, there is no reason to believe that the noise process affects any of the original features/variables more strongly than the others – i.e. it is reasonable to consider the noise process to be isotropic. If not, then the preferred directions are essentially some form of signal and not noise. This means we expect the noise eigenvalues to be approximately the same, and let’s say they have an average value \sigma^{2}. As we look at data sets of increasing dimension p, unless the number of signal eigenvalues, k, or the eigenvalues, \lambda_{1},\ldots,\lambda_{k}, themselves grow with p, then the variance explained by the signal values, \sum_{i=1}^{k}\lambda_{i}, will remain relatively static whilst the noise eigenvalues contribute (p-k-1)\sigma^{2} to the total variance. Thus we can see that the fraction of total variance explained by the signal components is essentially,

{\rm Fraction\;of\;variance\;explained}\;=\;\frac{\sum_{i=1}^{k}\lambda_{i}}{(p-k-1)\sigma^{2}\;+\;\sum_{i=1}^{k}\lambda_{i}}\;\;,

and so decreases as the data dimension p increases. Consequently, for high-dimensional data, where p is very large (often in the thousands) the percentage of variance explained by the true signal components can be very very small. Conversely, if we select the number of components so as to retain say 90% of the total variance, we will be including a lot of noise in the retained components and not reducing the dimensionality as efficiently as we could be.

In contrast, if the true eigenvalues are split into a small number of ‘signal’ eigenvalues and a much larger number of  ‘noise’ eigenvalues and we expect the noise process to be isotropic (i.e. we have a single highly degenerate noise eigenvalue), then the observed (sample) eigenvalues will also consist of a bulk of eigenvalues clustered around a single value and a small number of larger eigenvalues separated from this bulk. In other words we expect to see the distribution of sample covariance eigenvalues look something like the plot on the left below.

SP500_ScreePlot1

The eigenvalues in this example have been obtained from the sample covariance matrix of inter-day returns of closing prices over the 4 year period 2010 to 2013 for S&P500 stocks. We have actually omitted the largest eigenvalue, as this of a different scale and represents a ‘market-mode’ – see the next section.

Note that the bulk of the sample eigenvalues show some dispersion even though we had only one true, highly degenerate, noise eigenvalue. This is due to sampling variation, i.e. the fact that we have only a finite sized sample. If we plot the top 20 eigenvalues from the distribution above, ranked from largest to smallest, we get a scree plot looking like the plot on the right above.

Clearly, the sample eigenvalues corresponding to the bulk are very similar to each other, and so we see only small decreases with increasing rank in the sample eigenvalues in the bulk. In contrast, where there is an atypical jump in eigenvalue as we decrease the rank, this represents the point at which a signal eigenvalue can be detected as being separate from the bulk. This point also represents the ‘knee’ of the scree plot.

Statistical models that produce this kind of scree-plot are called ‘spiked-covariance’ models, so-called because the true population eigenvalue distribution is concentrated (or spiked) around just a small number of values. For these models we consider the data to have been produced by a small number of latent factors with isotropic noise. That is, our data point {\bf x}_{i} is given by,

{\bf x}_{i}\;=\;\sum_{j=1}^{k}z_{ij}\lambda_{j}{\bf B}_{j}\;+\; \boldsymbol{\epsilon}_{i}\;\;\;,\;\;\;\boldsymbol{\epsilon}_{i}\sim {\cal N}\left ( {\bf 0}, \sigma^{2}{\bf I} \right)\;\;\;,\;\;\;z_{ij}\sim {\cal N}\left ( 0, 1\right)\;\;\;,

with the vectors {\bf B}_{1},\ldots,{\bf B}_{k} forming an orthonormal set.

From the mathematical form of the spiked-covariance models we can work out the expected distribution of sample covariance eigenvalues when our data consists of N data points (each of dimension p). A large amount of research has been done in this area over the last 10-15 years. I won’t try to summarize it here, instead I’ll point you to the excellent review by Johnstone, and the original work by Paul. This research allows us to devise methods for the automatic detection of the number of principal components. Such methods have the advantage that they can be automated, i.e. programmed as part of an algorithm. On the practical side you can always “eyeball” the scree plot. For an isolated piece of analysis this is always advisable, as it takes so little time to do.

Also on the pragmatic side I have often found that simple ‘knee’ detection algorithms  work surprisingly well, particularly for real ‘real-world’ data sets that I encounter as part of my day-to-day work. The simplest of such algorithms involves finding the maximum in the value of a discrete approximation to the second derivative of the ranked eigenvalue plot. That is, we choose k as,

k \;=\; \underset{i}{\mathrm{argmax}} \left ( \lambda_{i+1} + \lambda_{i-1} - 2\lambda_{i}\right )

Improved approaches to ‘knee’ detection are based upon discrete approximations of the curvature. The paper by Satopää et al gives a good introduction to such methods. Again, these simple ‘knee’ detection approaches have the advantage that they can be coded and hence included as part of some automated process.

Note that in situations where the signal eigenvalues (those distinct from the bulk) do contribute the majority of the variance in the original data, then selecting the number of PCs on the basis of detecting a ‘knee’ in a scree plot will have the same advantages as selecting on the basis of wanting to retain a certain fixed percentage of the total variance. Consequently, there is very little reason not use the scree plot for selection in all cases.

3. Interpretation of the loadings

I have also found it is not uncommon for PCA to be blindly applied and the scores, i.e. the values of the new, lower-dimensional, features to be plotted, fed into some downstream process without any further curiosity being applied. Where the leading sample eigenvalue is very large – that is on a different scale to the other principal components retained or the bulk of the sample eigenvalues – I always take a look at the loadings. This is the case for the S&P500 data discussed above. The loadings tell us the contribution each of the original variables make to the principal component. The loadings for the 1st principal component of the S&P500 data is shown in the plot below,

SP500_1stPC_Loadings

Where the 1st sample eigenvalue is very large (compared to the others) it is not uncommon to see a loading pattern like that above – each loading value is of the same sign (and in this case of comparable size). The large 1st eigenvalue tells us that variation along the 1st principal component is the dominant mode of variation present in the original data. The loading pattern tells us that in this mode of variation all the original variables increase together, or decrease together. Such a pattern is often termed a ‘global mode’, i.e. it is a mode of variation that largely has the same effect globally on all the original variables. There are several scenarios and situations were the presence of a global mode can be naturally explained or is to be expected. For example,

  • Market modes in stock prices. This is where a rising or falling market causes all stock prices to rise or fall together.
  • In gene expression data obtained from model organisms exposed to a large environmental perturbation or insult. Here, for model organisms, e.g. yeast cultures, we can shock the biological system being studied without any ethical concerns, e.g. starve the organism of its primary food/fuel source or other essential nutrients. Consequently we see a system wide response to the starvation.
  • Price sensitivities of a collection of products sold by a retailer in a store. Here we expect price elasticities of products will reflect, to a large part, the economic conditions of the local geography. Consequently, a predominant part of the variation in product elasticities will be due to store-to-store variation in economic conditions and may show up as a global mode.

Where we see a global mode in the loadings, we should ask whether we can identify a credible mechanism behind the global mode. If not, then this should make us cautious about the appropriateness of the 1st principal component and hence the complete decomposition.

Finally it is worth mentioning that, if a sparse loading pattern is more naturally expected or convenient then sparse versions of PCA can be used. The lecture notes by Rob Tibshirani provide a good introduction to sparse PCA.

 

 

 

Log Partition Function of RBMs

I have been doing some work recently on Restricted Boltzmann Machines (RBMs). Specifically, I have been looking at the evaluation of the log of the partition function.

RBMs consist of a layer of visible nodes and a layer of hidden nodes, with the number of hidden nodes typically being less than the number of visible nodes.  Where both visible and hidden nodes have binary states we can think of the RBM as performing a discrete-to-discrete dimensionality reduction. Stacked RBMs provided some of the earliest examples of deep learning neural networks – see for example the work of Hinton and Salakhutdinov.

RBM

The partition function Z is the normalizing constant for the joint distribution over the states of the visible and hidden nodes, and is often used for model selection, i.e. when we want to control or select the right level of complexity in the RBM.  I wanted to test out some of the ideas behind the message passing algorithm of Huang and Toyoizumi (arxiv version here).  Huang and Toyoizumi use a Bethe approximation to develop a mean-field approximation for log Z. As usual, the self-consistent mean-field equations lead to a set of coupled equations for the expected magnetization, which are solved iteratively leading to the passing of information on local field strengths between nodes – the so-called message passing. To test the Huang and Toyoizumi algorithm I need to know the true value of log Z.

A standard, non mean-field, method for evaluation of the log-partition function is the Annealed Importance Sampling (AIS) algorithm of Salakhutdinov and Murray, who base their derivation on the generic AIS work of Neal (arxiv version). The AIS algorithm is an Monte Carlo based approach and samples from a series of RBMs that go from being completely decoupled (no visible to hidden node interactions) to the fully coupled RBM of interest.

I have pushed my implementations of the Huang and Toyoizumi message passing algorithm and the Salakhutdinov and Murray AIS algorithm to github. However, there is still the question of how do I test the implementations given that there is no simple closed form analytical expressions for log Z when we have visible to hidden node coupling? Fortunately, as the RBMs are of finite size, then for sufficiently small hidden and visible layers we can evaluate logZ ‘exactly’ via complete enumeration of all the states of the visible and hidden layers. I say ‘exactly’ as some numerical approximation can be required when combining terms in the partition function whose energies are on very different scales. I have also included in the github repository code to do the ‘exact’ evaluation.

China invests big in AI

This week saw interesting news and career guide articles in Nature highlighting Chinese government plans for its AI industry. The goal of the Chinese government is to become a world leader in AI by 2030. China forecasts that the value of its core AI industries will be US$157.7Billion in 2030 (based on exchange rate at 2018/01/19). How realistic that goal is will obviously depend upon what momentum there already is within China’s AI sector, but even so I was still struck and impressed by the ambition of the goal – 2030 is only 12 years away, which is not long in research and innovation terms. The Nature articles are worth a read (and are not behind a paywall).

ChinaAI_screenshot
Nature news article on China’s ‘New Generation of Artificial Intelligence Development Plan’

What will be the effect of China’s investment in AI? Attempting to make technology based predictions about the future can be ill-advised, but I will speculate anyway, as the articles, for me, prompted three immediate questions:

  • How likely is China to be successful in achieving its goal?
  • What sectors will it achieve most influence in?
  • What are competitor countries doing?

How successful will China be?

Whatever your opinions on the current hype surrounding AI, Machine Learning, and Data Science, there tends to a consensus that Machine Learning will emerge from its current hype-cycle with some genuine gains and progress. This time it is different. The fact that serious investment in AI is being made not just by corporations but by governments (including the UK) could be taken as an indicator that we are looking beyond the hype. Data volumes, compute power, and credible business models are all present simultaneously in this current AI/Machine Learning hype-cycle, in ways that they weren’t in the 1980s neural network boom-and-bust and other AI Winters. Machine Learning and Data Science is becoming genuinely commoditized. Consequently,  the goal China has set itself is about building capacity, i.e. about the transfer of knowledge from a smaller innovation ecosystem (such as the academic community and a handful of large corporate labs) to produce a larger but highly-skilled bulk of practitioners. A capacity building exercise such as this should be a known quantity and so investments will scale – i.e. you will see proportional returns on those investments. The Nature news article does comment that China may face some challenges in strengthening the initial research base in AI, but this may be helped by the presence of large corporate players such as Microsoft and Google, who have established AI research labs within the country.

What sectors will be influenced most?

One prominent area for applications of AI and Machine Learning is commerce, and China provides a large potential market place. However, access to that market can be difficult for Western companies and so Chinese data science solution providers may face limited external competition on their home soil. Equally, Chinese firms wishing to compete in Western markets, using expertise of the AI-commerce interface gained from their home market, may face tough challenges from the mature and experienced incumbents present in those Western markets. Secondly, it may depend precisely on which organizations in China develop the beneficial experience in the sector. The large US corporates (Microsoft, Google) that have a presence in China are already main players in AI and commerce in the West, and so may not see extra dividends beyond the obvious ones of access to the Chinese market and access to emerging Chinese talent. Overall, it feels that whilst China’s investment in this sector will undoubtedly be a success, and Chinese commerce firms will be a success, China’s AI investment may not significantly change the direction the global commerce sector would have taken anyway with regard to its use and adoption of AI.

Perhaps more intriguing will be newer, younger sectors in which China has already made significant investment. Obvious examples, such as genomics, spring to mind, given the scale of activity by organizations such as BGI (including the AI-based genomic initiative of the BGI founder Jun Wang). Similarly, robotics is another field highlighted within the Nature articles.

What are China’s competitors investing in this area?

I will restrict my comments to the UK, which, being my home country, I am more familiar with. Like China, the UK has picked out AI, Robotics, and a Data Driven Economy as areas that will help enable a productive economy. Specifically, the UK Industrial Strategy announced last year identifies AI for one of its first ‘Sector Deals’ and also as one of four Grand Challenges. The benefits of AI is even called out in other Sector Deals, for example in the Sector Deal for the Life Sciences.  This is on top of existing UK investment in Data Science, such as the Alan Turing Institute (ATI) and last year’s announcement by the ATI that it is adding four additional universities as partners. In addition we have capacity-building calls from research councils, such as the EPSRC call for proposals for Centres for Doctoral Training (CDTs). From my quick reading, 4 of the 30 priority areas that the EPSRC has highlighted for CDTs make explicit reference to AI, Data Science, or Autonomous Systems. The number of priority areas that will have some implicit dependence on AI or Data Science will be greater. Overall the scale of the UK investment is, naturally, unlikely to match that of China – the original Nature report on the Chinese plans says that no mention of level of funding is made.  However, the likely scale of the Chinese governmental investment in AI will ultimately give that country an edge, or at least a higher probability of success. Does that mean the UK needs to re-think and up its investment?

Notes:

Faa di Bruno and derivatives of an iterated function

I have recently needed to do some work evaluating high-order derivatives of composite functions. Namely, given a function f(t), evaluate the n^{th} derivative of the composite function  \underbrace{\left (f\circ f\circ f \circ \ldots\circ f \right )}_{l\text{ terms}}(t). That is, we define f_{l}(t) to be the function obtained by iterating the base function f(t)=f_{1}(t) l-1 times. One approach is to make recursive use of the Faa di Bruno formula,

\displaystyle \frac{d^{n}}{dx^{n}}f(g(x))\;=\;\sum_{k=1}^{n}f^{(k)}(g(x))B_{n,k}\left (g'(x), g''(x), \ldots, g^{(n-k+1)}(x) \right )      Eq.(1)

The fact that the exponential partial Bell polynomials B_{n,k}\left (x_{1}, x_{2},\ldots, x_{n-k+1} \right ) are available within the {\tt sympy} symbolic algebra Python package, makes this initially an attractive route to evaluating the required derivatives. In particular, I am interested in evaluating the derivatives at t=0 and I am focusing on odd functions of t, for which t=0 is then obviously a fixed-point. This means I only have to supply numerical values for the derivatives of my base function f(t) evaluated at t=0, rather than supplying a function that evaluates derivatives of f(t) at any point t.

Given the Taylor expansion of f(t) about t=0 we can easily write code to implement the Faa di Bruno formula using sympy. A simple bit of pseudo-code to represent an implementation might look like,

  1. Generate symbols.
  2. Generate and store partial Bell polynomials up to known required order using the symbols from step 1.
  3. Initialize coefficients of Taylor expansion of the base function.
  4. Substitute numerical values of derivatives from previous iteration into symbolic representation of polynomial.
  5. Sum required terms to get numerical values of all derivatives of current iteration.
  6. Repeat steps 4 & 5.

I show python code snippets below implementing the idea. First we generate and cache the Bell polynomials,

 # generate and cache Bell polynomials
 bellPolynomials = {}
 for n in range(1, nMax+1):
      for k in range(1, n+1):
         bp_tmp = sympy.bell(n, k, symbols_tmp)
         bellPolynomials[str(n) + '_' + str(k)] = bp_tmp

Then we iterate over the levels of function composition, substituting the numerical values of the derivatives of the base function into the Bell polynomials,

for iteration in range(nIterations):
    if( verbose ):
        print( "Evaluating derivatives for function iteration " + str(iteration+1) )

    for n in range(1, nMax+1):
        sum_tmp = 0.0
        for k in range(1, n+1):
            # retrieve kth derivative of base function at previous iteration
            f_k_tmp = derivatives_atFixedPoint_tmp[0, k-1]

            #evaluate arguments of Bell polynomials
            bellPolynomials_key = str( n ) + '_' + str( k )
            bp_tmp = bellPolynomials[bellPolynomials_key]
            replacements = [( symbols_tmp[i],
            derivatives_atFixedPoint_tmp[iteration, i] ) for i in range(n-k+1) ]
            sum_tmp = sum_tmp + ( f_k_tmp * bp_tmp.subs(replacements) )

        derivatives_atFixedPoint_tmp[iteration+1, n-1] = sum_tmp

Okay – this isn’t really making use true recursion, merely looping, but the principle is the same. The problem one encounters is that the manipulation of the symbolic representation of the polynomials is slow, and run-times slow significantly for n > 15.

However, the n^{th} derivative can alternatively be expressed as a sum over partitions of n as,

\displaystyle \frac{d^{n}}{dx^{n}}f(g(x))\;=\;\sum \frac{n!}{m_{1}!m_{2}!\ldots m_{n}!} f^{(m_{1}+m_{2}+\ldots+m_{n})}\left ( g(x)\right )\prod_{j=1}^{n}\left ( \frac{g^{(j)}(x)}{j!}\right )^{m_{j}}   Eq.(2)

where the sum is taken over all non-negative integers tuples m_{1}, m_{2},\ldots, m_{n} that satisfy 1\cdot m_{1}+ 2\cdot m_{2}+\ldots+ n\cdot m_{n}\;=\; n. That is, the sum is taken over all partitions of  n. Fairly obviously the Faa di Bruno formula is just a re-arrangement of the above equation, made by collecting terms involving f^{(k)}(g(x)) together, and as such that rearrangement gives the fundamental definition of the partial Bell polynomial.

I’d shied away from the more fundamental form of Eq.(2) in favour of Eq.(1) as I believed the fact that a number of terms had already been collected together in the form of the Bell polynomial would make any implementation that used them quicker. However, the advantage of the form in Eq.(2) is that the summation can be entirely numeric, provided an efficient generator of partitions of n is available to us. Fortunately, sympy also contains a method for iterating over partitions. Below are code snippets that implement the evaluation of f_{l}^{(n)}(0) using Eq.(2). First we generate and store the partitions,

# store partitions
pStore = {}
for k in range( n ):
    # get partition iterator
    pIterator = partitions(k+1)
    pStore[k] = [p.copy() for p in pIterator]

After initializing arrays to hold the derivatives of the current function iteration we then loop over each iteration, retrieving each partition and evaluating the product in the summand of Eq.(2). Obviously, it is relatively easy working on the log scale, as shown in the code snippet below,

# loop over function iterations
for iteration in range( nIterations ):

    if( verbose==True ):
        print( "Evaluating derivatives for function iteration " + str(iteration+1)  )

    for k in range( n ):
        faaSumLog = float( '-Inf' )
        faaSumSign = 1

        # get partitions
        partitionsK = pStore[k]
        for pidx in range( len(partitionsK) ):
            p = partitionsK[pidx]
            sumTmp = 0.0
            sumMultiplicty = 0
            parityTmp = 1
            for i in p.keys():
                value = float(i)
                multiplicity = float( p[i] )
                sumMultiplicty += p[i]
                sumTmp += multiplicity * currentDerivativesLog[i-1]
                sumTmp -= gammaln( multiplicity + 1.0 )
                sumTmp -= multiplicity * gammaln( value + 1.0 )
                parityTmp *= np.power( currentDerivativesSign[i-1], multiplicity )	

            sumTmp += baseDerivativesLog[sumMultiplicty-1]
            parityTmp *= baseDerivativesSign[sumMultiplicty-1]

            # now update faaSum on log scale
            if( sumTmp > float( '-Inf' ) ):
                if( faaSumLog > float( '-Inf' ) ):
                    diffLog = sumTmp - faaSumLog
                    if( np.abs(diffLog) = 0.0 ):
                            faaSumLog = sumTmp
                            faaSumLog += np.log( 1.0 + (float(parityTmp*faaSumSign) * np.exp( -diffLog )) )
                            faaSumSign = parityTmp
                        else:
                            faaSumLog += np.log( 1.0 + (float(parityTmp*faaSumSign) * np.exp( diffLog )) )
                    else:
                        if( diffLog > thresholdForExp ):
                            faaSumLog = sumTmp
                            faaSumSign = parityTmp
                else:
                    faaSumLog = sumTmp
                    faaSumSign = parityTmp

        nextDerivativesLog[k] = faaSumLog + gammaln( float(k+2) )
        nextDerivativesSign[k] = faaSumSign

Now let’s run both implementations, evaluating up to the 15th derivative for 4 function iterations. Here my base function is f(t) =1 -\frac{2}{\pi}\arccos t. A plot of the base function is shown below in Figure 1.

FaaDiBrunoExample_BaseFunction
Figure 1: Plot of the base function f(t)\;=\;1\;-\;\frac{2}{\pi}\arccos(t)

The base function has a relatively straight forward Taylor expansion about t=0,

\displaystyle f(t)\;=\;\frac{2}{\pi}\sum_{k=0}^{\infty}\frac{\binom{2k}{k}t^{2k+1}}{4^{k}\left ( 2k+1 \right )}\;\;\;,\;\;\;|t| \leq 1 \;\;,    Eq.(3)

and so supplying the derivatives, f^{(k)}(0), of the base function is easy. The screenshot below shows a comparison of f_{l}^{(15)}(0) for l\in \{2, 3, 4, 5\}. As you can see we obtain identical output whether we use sympy’s Bell polynomials or sympy’s partition iterator.

FaaDiBruno_TimingComparison

The comparison of the implementations is not really a fair one. One implementation is generating a lot of symbolic representations that aren’t really needed, whilst the other is keeping to entirely numeric operations. However, it did highlight several points to me,

  • Directly working with partitions, even up to moderate values of n, e.g. n=50, can be tractable using the sympy package in python.
  • Sometimes the implementation of the more concisely expressed representation (in this case in terms of Bell polynomials) can lead to an implementation with significantly longer run-times, even if the more concise representation can be implemented concisely (less lines of code).
  • The history of the Faa di Bruno formula, and the various associated polynomials and equivalent formalisms (such as the Jabotinksy matrix formalism) is a fascinating one.

I’ve put the code for both methods of evaluating the derivatives of an iterated function as a gist on github.

At the moment the functions take an array of Taylor expansion coefficients, i.e. they assume the point at which derivatives are requested is a fixed point of the base function. At some point I will add methods that take a user-supplied function for evaluating the k^{th} derivative, f^{(k)}(t), of the base function at any point t and will return the derivatives, f_{l}^{(k)}(t) of the iterated function.

I haven’t yet explored whether, for reasonable values of n (say n \leq 50), I need to work on the log scale, or whether direct evaluation of the summand will be sufficiently accurate and not result in overflow error.

Manchester R User Group Meetup – May 2017

At the latest Manchester R User Group meeting (organized by Mango Solutions) Leanne Fitzpatrick from HelloSoda gave a talk on Deploying Models in a Machine Learning Environment.

Leanne spoke about how the use of Docker had speeded up the deployment of machine learning models into the production environment, and had also enabled easier monitoring and updating of the models.

One of the additional benefits, and Leanne alluded that this may even have been the original motivation, was that of reducing the barriers between the data scientists and software engineers in the company. Data Science is an extremely broad church, encompassing a wide range of skill-sets and disciplines. Inevitably, there can be culture-clashes between those who consider themselves to be from the ‘science’ side of Data Science, and those from the engineering side of Data Science. Scientists are people who like to explore data, develop proof-of-concept projects, but who are often not the most disciplined in code writing and organization, and for whom operational deployment of a model is the last stage in their thinking. Scientists break things. Scientists like to break things. Scientists learn by breaking things.

xkcd_the_difference
Scientists are different (taken from xkcd.com)

Data Scientists who break things can be seen as an annoyance to those responsible for maintaining the operational infrastructure.

Obviously, in a commercial environment the data scientists and software engineers/developers need to work as efficiently together as possible. The conclusion that Leanne presented in her talk suggested that HelloSoda have taken some steps towards solving this problem through their use of containerization of the models.  I say, ‘some steps’, as I can’t believe that any organization can completely remove all such barriers. Having worked in inter-disciplinary teams in both the commercial world and in academic research I’ve seen some teams work well together and others not. What tools and protocols an organization can use to generally reduce the barriers between investigative Data Science and operational Data Science is something that intrigues me – something for a longer post maybe.