Defensive & Offensive Predictive Analytics

Harvard_business_review

This article from the Harvard Business Review was a short but interesting read. The article talks about defensive and offensive data management strategies – defensive being about minimizing downside risk, and being more common in regulated industries, whilst offensive data management strategies focus on supporting business objectives and are more common in un-regulated or less regulated industries. The authors rightly point out that the correct mix of defensive and offensive strategies will be specific to each individual organization.

Having worked in a number of commercial sectors, my experience is that the use of predictive analytics within organizations also divides into defensive and offensive activities, irrespective of whether that predictive analytics and data science activity is enabled by a well thought out data management strategy. There are good reasons for this, and again it is largely determined by whether or not the activity of an organization carries a large downside risk.

Consider a company, such as a bank, whose activity has a large downside risk, and where losses due to bad loans can almost wipe out a balance sheet very quickly. My experience of doing analytics in a UK retail bank is that the predictive analytics focus is on modelling that downside risk with a view to understanding it, forecasting it and ultimately mitigating against it. The analytics effort focuses on risk minimization (still an optimization), whilst optimization of the profit side of the P&L is less computationally based, e.g. by committees of human subject matter experts deciding mortgage or savings rates for the coming year.

In contrast, in companies where the downside risk is lower, such as those where transactions with the organization’s customers are on much shorter timescales than a bank, then the use of predictive modelling tends to focus more on the optimization of revenue and profits, rather than minimization of losses from liabilities. Take grocery supermarkets, where predictive demand models are used to set product prices in order to optimize profit. Whilst getting the pricing strategy wrong will impact revenues, it does not lead to the organization holding a long-term liability and is ultimately reversible. Mistakes when using predictive models in this domain are unlikely to take a company down.  

From what I have seen the use of predictive modelling within a business is typically almost binary, i.e. either predominantly on the downside risk, or predominantly on optimizing the business objectives, even though most businesses will have both upsides and downsides to their activity. I haven’t seen that many medium-scale organizations where predictive modelling is used at a mature level across the majority of the business areas or tasks. Even rarer in my experience are situations where predictive modelling of both downside risks and business objectives are done concurrently with the optimization taking into account both sides of the P&L. It would be interesting to find good examples outside, say the largest 50 companies in the FTSE100, DowJones, Nasdaq, or S&P500, where a more joined up approach is taken to using predictive analytics for optimizing the P&L.

Practical PCA

PCA_Schematic

Principal Component Analysis (PCA) is a commonly applied algorithm in statistics and data science. Because it is so easy to understand at a high-level, and because it is so easy to apply, PCA has become ubiquitous. It is often applied without much thought and with the output rarely questioned. Typically, the questions I ask when applying PCA are,

  1. Do I need to do any transformation of the data before applying PCA?
  2. How many principal components should I select?
  3. How do I interpret the loadings (and scores)?

Unfortunately, I have seen a number of talks and presentations recently where PCA has been used and the impact on the analysis of not having thought about these questions was clear. Ok, I’m biased, as the behaviour of PCA, particularly when applied to high-dimensional data, is one of my areas of research. Whilst the research papers on PCA can be very complex, they do however provide some useful insight and guides on how to apply PCA to real data. In this post I’m going to look at those three questions in turn. In the following I’m also going to assume you are already familiar with PCA and that you are aware that the principal components are the eigenvectors of the sample covariance matrix corresponding to the largest eigenvalues. For a good introduction to PCA see the blog post by Laura Hamilton and also the classic book by Jolliffe.

TL;DR  – the the short answers to the questions above are, i) Check for outliers and/or heavy tails in your data before applying PCA, ii) use the ‘knee’ in the eigenvalue scree plot to select the number of components, iii) make sure you look at the loadings of the selected principal components and that you can explain any major patterns in those loadings.

1. Transformation of the data

The original derivations of PCA, such as that by Hotelling (1933), are heuristic and make no explicit assumptions other than the requirement that the selected components retain as much of the variance in the original data as possible.  The formal derivation of PCA as the maximum likelihood estimate of parameters of a probabilistic model assumes additive Gaussian noise – see for example the paper by Tipping and Bishop. In practice, where the distributions of latent factors and additive noise are still reasonably symmetric and decay sufficiently fast we would expect deviations from perfect Gaussians to have only a minor effect. Despite this I have seen a number of talks with PCA plots similar to that shown on the left below.

PCAScores_HeavyTailExample1

I have based this example on a real commercial data set I was shown, but for privacy and confidentiality reasons I have generated a simulated data set that reproduces the issue. The majority of PC1 scores are very squashed into the left hand side of the plot, with a few scores reaching much higher values than the rest. This can been seen more clearly if we look at the estimated density of the 1st PC scores – see the right hand plot above. A heavy tail is clearly present, meaning we are deviating significantly from the assumptions under which the eigenvectors of the sample covariance matrix are the optimal estimators of the signal directions in the raw data.

So an obvious first step would be to take a quick look at the distribution of the raw data that we are trying to decompose using PCA. If that distribution has a heavy tail or significant outliers then there is an argument for applying a transformation (e.g. logarithm) before applying PCA. If we take the log of the example data above then we obtain much more reasonable distributions for the PC scores – see below.

PCAScores_HeavyTailExample2

Tip: If you see an elongated distribution of scores along the PCs you have selected, then it may be worth going back and looking at the distribution of the raw data going into the PCA – you should have already looked at the distribution of your raw data anyway as part of EDA best practice.

2. Selection of the number of components

There are two common methods for selecting the number of components that most people will be familiar with or encounter. Those are,

  1. Select the number of components so that a fixed proportion, e.g. 90%, of the total variance is retained.
  2. Look at a scree plot of the eigenvalues and locate the ‘elbow’ (or ‘knee’ depending on your interpretation of anatomy).

Of those two methods it is the second that I always prefer. This is because it has a sound theoretical underpinning and is more robust when applied to the kind of high-dimensional data sets that are commonplace nowadays. Let me explain why. The goal of PCA is to select a small number of directions in the data that we believe capture the signal within the data. The first approach to PCA model selection is effectively based upon the assumption that the ‘signal’ contribution to the total variance is considerably greater than the noise contribution to the total variance, and thus by selecting to retain the majority of the total variance in the original data we believe we are effectively selecting components that represent signal.

If our data points are p dimensional, then we have p sample covariance eigenvalues, \lambda_{i},\;i=1,\ldots,p. We consider the signal part of the data is represented by a small number, k, of eigenvectors/eigenvalues, and we have the p-k-1 remaining non-zero eigenvalues that correspond to pure noise in the original data.

Typically, there is no reason to believe that the noise process affects any of the original features/variables more strongly than the others – i.e. it is reasonable to consider the noise process to be isotropic. If not, then the preferred directions are essentially some form of signal and not noise. This means we expect the noise eigenvalues to be approximately the same, and let’s say they have an average value \sigma^{2}. As we look at data sets of increasing dimension p, unless the number of signal eigenvalues, k, or the eigenvalues, \lambda_{1},\ldots,\lambda_{k}, themselves grow with p, then the variance explained by the signal values, \sum_{i=1}^{k}\lambda_{i}, will remain relatively static whilst the noise eigenvalues contribute (p-k-1)\sigma^{2} to the total variance. Thus we can see that the fraction of total variance explained by the signal components is essentially,

{\rm Fraction\;of\;variance\;explained}\;=\;\frac{\sum_{i=1}^{k}\lambda_{i}}{(p-k-1)\sigma^{2}\;+\;\sum_{i=1}^{k}\lambda_{i}}\;\;,

and so decreases as the data dimension p increases. Consequently, for high-dimensional data, where p is very large (often in the thousands) the percentage of variance explained by the true signal components can be very very small. Conversely, if we select the number of components so as to retain say 90% of the total variance, we will be including a lot of noise in the retained components and not reducing the dimensionality as efficiently as we could be.

In contrast, if the true eigenvalues are split into a small number of ‘signal’ eigenvalues and a much larger number of  ‘noise’ eigenvalues and we expect the noise process to be isotropic (i.e. we have a single highly degenerate noise eigenvalue), then the observed (sample) eigenvalues will also consist of a bulk of eigenvalues clustered around a single value and a small number of larger eigenvalues separated from this bulk. In other words we expect to see the distribution of sample covariance eigenvalues look something like the plot on the left below.

SP500_ScreePlot1

The eigenvalues in this example have been obtained from the sample covariance matrix of inter-day returns of closing prices over the 4 year period 2010 to 2013 for S&P500 stocks. We have actually omitted the largest eigenvalue, as this of a different scale and represents a ‘market-mode’ – see the next section.

Note that the bulk of the sample eigenvalues show some dispersion even though we had only one true, highly degenerate, noise eigenvalue. This is due to sampling variation, i.e. the fact that we have only a finite sized sample. If we plot the top 20 eigenvalues from the distribution above, ranked from largest to smallest, we get a scree plot looking like the plot on the right above.

Clearly, the sample eigenvalues corresponding to the bulk are very similar to each other, and so we see only small decreases with increasing rank in the sample eigenvalues in the bulk. In contrast, where there is an atypical jump in eigenvalue as we decrease the rank, this represents the point at which a signal eigenvalue can be detected as being separate from the bulk. This point also represents the ‘knee’ of the scree plot.

Statistical models that produce this kind of scree-plot are called ‘spiked-covariance’ models, so-called because the true population eigenvalue distribution is concentrated (or spiked) around just a small number of values. For these models we consider the data to have been produced by a small number of latent factors with isotropic noise. That is, our data point {\bf x}_{i} is given by,

{\bf x}_{i}\;=\;\sum_{j=1}^{k}z_{ij}\lambda_{j}{\bf B}_{j}\;+\; \boldsymbol{\epsilon}_{i}\;\;\;,\;\;\;\boldsymbol{\epsilon}_{i}\sim {\cal N}\left ( {\bf 0}, \sigma^{2}{\bf I} \right)\;\;\;,\;\;\;z_{ij}\sim {\cal N}\left ( 0, 1\right)\;\;\;,

with the vectors {\bf B}_{1},\ldots,{\bf B}_{k} forming an orthonormal set.

From the mathematical form of the spiked-covariance models we can work out the expected distribution of sample covariance eigenvalues when our data consists of N data points (each of dimension p). A large amount of research has been done in this area over the last 10-15 years. I won’t try to summarize it here, instead I’ll point you to the excellent review by Johnstone, and the original work by Paul. This research allows us to devise methods for the automatic detection of the number of principal components. Such methods have the advantage that they can be automated, i.e. programmed as part of an algorithm. On the practical side you can always “eyeball” the scree plot. For an isolated piece of analysis this is always advisable, as it takes so little time to do.

Also on the pragmatic side I have often found that simple ‘knee’ detection algorithms  work surprisingly well, particularly for real ‘real-world’ data sets that I encounter as part of my day-to-day work. The simplest of such algorithms involves finding the maximum in the value of a discrete approximation to the second derivative of the ranked eigenvalue plot. That is, we choose k as,

k \;=\; \underset{i}{\mathrm{argmax}} \left ( \lambda_{i+1} + \lambda_{i-1} - 2\lambda_{i}\right )

Improved approaches to ‘knee’ detection are based upon discrete approximations of the curvature. The paper by Satopää et al gives a good introduction to such methods. Again, these simple ‘knee’ detection approaches have the advantage that they can be coded and hence included as part of some automated process.

Note that in situations where the signal eigenvalues (those distinct from the bulk) do contribute the majority of the variance in the original data, then selecting the number of PCs on the basis of detecting a ‘knee’ in a scree plot will have the same advantages as selecting on the basis of wanting to retain a certain fixed percentage of the total variance. Consequently, there is very little reason not use the scree plot for selection in all cases.

3. Interpretation of the loadings

I have also found it is not uncommon for PCA to be blindly applied and the scores, i.e. the values of the new, lower-dimensional, features to be plotted, fed into some downstream process without any further curiosity being applied. Where the leading sample eigenvalue is very large – that is on a different scale to the other principal components retained or the bulk of the sample eigenvalues – I always take a look at the loadings. This is the case for the S&P500 data discussed above. The loadings tell us the contribution each of the original variables make to the principal component. The loadings for the 1st principal component of the S&P500 data is shown in the plot below,

SP500_1stPC_Loadings

Where the 1st sample eigenvalue is very large (compared to the others) it is not uncommon to see a loading pattern like that above – each loading value is of the same sign (and in this case of comparable size). The large 1st eigenvalue tells us that variation along the 1st principal component is the dominant mode of variation present in the original data. The loading pattern tells us that in this mode of variation all the original variables increase together, or decrease together. Such a pattern is often termed a ‘global mode’, i.e. it is a mode of variation that largely has the same effect globally on all the original variables. There are several scenarios and situations were the presence of a global mode can be naturally explained or is to be expected. For example,

  • Market modes in stock prices. This is where a rising or falling market causes all stock prices to rise or fall together.
  • In gene expression data obtained from model organisms exposed to a large environmental perturbation or insult. Here, for model organisms, e.g. yeast cultures, we can shock the biological system being studied without any ethical concerns, e.g. starve the organism of its primary food/fuel source or other essential nutrients. Consequently we see a system wide response to the starvation.
  • Price sensitivities of a collection of products sold by a retailer in a store. Here we expect price elasticities of products will reflect, to a large part, the economic conditions of the local geography. Consequently, a predominant part of the variation in product elasticities will be due to store-to-store variation in economic conditions and may show up as a global mode.

Where we see a global mode in the loadings, we should ask whether we can identify a credible mechanism behind the global mode. If not, then this should make us cautious about the appropriateness of the 1st principal component and hence the complete decomposition.

Finally it is worth mentioning that, if a sparse loading pattern is more naturally expected or convenient then sparse versions of PCA can be used. The lecture notes by Rob Tibshirani provide a good introduction to sparse PCA.

 

 

 

China invests big in AI

This week saw interesting news and career guide articles in Nature highlighting Chinese government plans for its AI industry. The goal of the Chinese government is to become a world leader in AI by 2030. China forecasts that the value of its core AI industries will be US$157.7Billion in 2030 (based on exchange rate at 2018/01/19). How realistic that goal is will obviously depend upon what momentum there already is within China’s AI sector, but even so I was still struck and impressed by the ambition of the goal – 2030 is only 12 years away, which is not long in research and innovation terms. The Nature articles are worth a read (and are not behind a paywall).

ChinaAI_screenshot
Nature news article on China’s ‘New Generation of Artificial Intelligence Development Plan’

What will be the effect of China’s investment in AI? Attempting to make technology based predictions about the future can be ill-advised, but I will speculate anyway, as the articles, for me, prompted three immediate questions:

  • How likely is China to be successful in achieving its goal?
  • What sectors will it achieve most influence in?
  • What are competitor countries doing?

How successful will China be?

Whatever your opinions on the current hype surrounding AI, Machine Learning, and Data Science, there tends to a consensus that Machine Learning will emerge from its current hype-cycle with some genuine gains and progress. This time it is different. The fact that serious investment in AI is being made not just by corporations but by governments (including the UK) could be taken as an indicator that we are looking beyond the hype. Data volumes, compute power, and credible business models are all present simultaneously in this current AI/Machine Learning hype-cycle, in ways that they weren’t in the 1980s neural network boom-and-bust and other AI Winters. Machine Learning and Data Science is becoming genuinely commoditized. Consequently,  the goal China has set itself is about building capacity, i.e. about the transfer of knowledge from a smaller innovation ecosystem (such as the academic community and a handful of large corporate labs) to produce a larger but highly-skilled bulk of practitioners. A capacity building exercise such as this should be a known quantity and so investments will scale – i.e. you will see proportional returns on those investments. The Nature news article does comment that China may face some challenges in strengthening the initial research base in AI, but this may be helped by the presence of large corporate players such as Microsoft and Google, who have established AI research labs within the country.

What sectors will be influenced most?

One prominent area for applications of AI and Machine Learning is commerce, and China provides a large potential market place. However, access to that market can be difficult for Western companies and so Chinese data science solution providers may face limited external competition on their home soil. Equally, Chinese firms wishing to compete in Western markets, using expertise of the AI-commerce interface gained from their home market, may face tough challenges from the mature and experienced incumbents present in those Western markets. Secondly, it may depend precisely on which organizations in China develop the beneficial experience in the sector. The large US corporates (Microsoft, Google) that have a presence in China are already main players in AI and commerce in the West, and so may not see extra dividends beyond the obvious ones of access to the Chinese market and access to emerging Chinese talent. Overall, it feels that whilst China’s investment in this sector will undoubtedly be a success, and Chinese commerce firms will be a success, China’s AI investment may not significantly change the direction the global commerce sector would have taken anyway with regard to its use and adoption of AI.

Perhaps more intriguing will be newer, younger sectors in which China has already made significant investment. Obvious examples, such as genomics, spring to mind, given the scale of activity by organizations such as BGI (including the AI-based genomic initiative of the BGI founder Jun Wang). Similarly, robotics is another field highlighted within the Nature articles.

What are China’s competitors investing in this area?

I will restrict my comments to the UK, which, being my home country, I am more familiar with. Like China, the UK has picked out AI, Robotics, and a Data Driven Economy as areas that will help enable a productive economy. Specifically, the UK Industrial Strategy announced last year identifies AI for one of its first ‘Sector Deals’ and also as one of four Grand Challenges. The benefits of AI is even called out in other Sector Deals, for example in the Sector Deal for the Life Sciences.  This is on top of existing UK investment in Data Science, such as the Alan Turing Institute (ATI) and last year’s announcement by the ATI that it is adding four additional universities as partners. In addition we have capacity-building calls from research councils, such as the EPSRC call for proposals for Centres for Doctoral Training (CDTs). From my quick reading, 4 of the 30 priority areas that the EPSRC has highlighted for CDTs make explicit reference to AI, Data Science, or Autonomous Systems. The number of priority areas that will have some implicit dependence on AI or Data Science will be greater. Overall the scale of the UK investment is, naturally, unlikely to match that of China – the original Nature report on the Chinese plans says that no mention of level of funding is made.  However, the likely scale of the Chinese governmental investment in AI will ultimately give that country an edge, or at least a higher probability of success. Does that mean the UK needs to re-think and up its investment?

Notes: