Featured

The different types of data I have encountered as a Data Scientist

TL;DR

Where Data or Data Science is the business model or primary purpose of an organization you can expect data and the data eco-system to be properly invested in. In scientific research and industrial settings this will be common. In the commercial world this is not always the case. As a Data Scientist in the commercial world you should learn to ask, ‘Does this company need data? Does this company actually need Data Science?’ If the answers are not a resounding yes, then beware. Sometimes you won’t be able to answer those questions until you’re up close and inside the organization, but there are indicators that suggest upfront whether an organization is likely to have good quality data and a functioning data eco-system, or whether it will be a pile of trash. In the long read below I outline the different kinds of data I’ve encountered across the commercial and non-commercial sectors I’ve worked in, and what signs I’ve learnt to look for.

Long version

Over 20 years ago, as I was just starting research in the Bioinformatics field, a colleague explained to me the challenges of working with high-throughput biological experimental data. The data, he commented, was typically high-dimensional and very messy. “Messy” was new to me. With a PhD in Theoretical Physics and post-doctoral research in materials modelling, the naïve physicist in me thought, ‘…ah…messy…you mean Gaussian errors with large variance”. I was very wrong.  

I learnt that “messy” could mean a high-frequency of missing data, missing meta-data, and mis-labelled data. Sometimes the data could be mis-labelled at source because the original biological material had been mis-labelled – I remember one interesting experience having to persuade my experimental collaborator to go back to the -80°C freezer to check the lung tissue sample because a mis-labelled sample was the only remaining explanation for the anomalous pattern we were seeing in a PCA plot, and yes something had gone wrong in the LIMS because the bar-code associated with the assay data did not match the bar-code on the sample in the freezer.

However, as I moved from the academic sphere into the commercial realm (10 years in the commercial sector now in 2021), I’ve learnt that data errors and issues can be much more varied and challenging, and varies from sector to sector, from domain to domain. BUT…. I have seen that there are broad classes that explain the patterns of data issues I have experienced over 30yrs of mathematical and statistical modelling. In this piece I am going to outline what those broad classes and patterns are, and more importantly how to recognize when they are likely to arise. The broad classes of data are,

  1. Scientific or Experimental – here, data is collected for the purposes of being analysed, and so therefore is optimized towards those purposes.
  2. Sensor data – here, data is collected for the purposes of detecting signals, possibly diagnostic, i.e., working out what the cause of a problem is, or being monitored, i.e., detecting a problem in the first place. It is intended primarily to be automatically monitored/processed, but not necessarily analysed by a human (Data Scientist) in the loop. The data lends itself to large-scale analysis but may not be in a form that is optimal or friendly for doing data science.
  3. Commercial operational data – this is data that is stored, rather than actively collected. It is stored, usually initially on a temporary (short-term), for the purposes of running the business. It could be server event data from applications supporting online platforms or retail sites, or customer transaction data, Google ad-click data, marketing spend data, or financial data.

It is this last category that is perhaps the most interesting – quite obviously if you are a Data Scientist working in a commercial sector. For commercial operational data, not only can the data contain the usual errors but there can be a range of additional challenges – columns/fields in tables containing completely different data than they are supposed to because a field has been over-loaded, or re-purposed for a different need without going through a modification/re-design of the schema. Without an accurate updated data-dictionary, this knowledge about the true content of the fields in a table resides as, ‘folk knowledge’ in the heads of the longer serving members of staff – until they leave the company, and that knowledge is lost.

I have my own favourite stories about data issues I have encountered in various organizations I have worked for – such as the time I worked with some server-side event data whose unique id for each event turned out only to be unique for a day, because the original developer didn’t think anyone would be interested in the data beyond a day.

Every commercial data scientist will be able to tell similar ‘war stories’, often exchanged over a few beers with colleagues after work. Such war stories will not go away any time soon. The more important point is how do we recognize the situations where they are more likely to occur? The contrast I have already highlighted above, between the different classes of data, gives us a clue – where data is not seen as the most valuable part of the process, nor the primary purpose of the process, it will be relatively less invested in and typically of lower quality.

For example, in the first two categories, and in the scientific realm in particular, data is generated with the idea, from the outset, that it is a valuable asset; the data is often the end itself. For example, the data collected from a scientific experiment provides the basis for scientific insight and discoveries, and potentially future IP; the data collected to monitor complex technical systems saves operating costs by minimizing downtime of those systems and potentially identifying improved efficiencies. Poor quality data, or inadequate systems for handling data has an immediate and often critical impact upon the primary goal of the organization or project, and so there is an immediate incentive to rectify errors and data quality issues and to invest in systems that enable efficient capture and processing of the data.

Some commercial organizations also effectively fall into these first two classes. For companies where the data and/or data science is the product from the outset, or the potential value of a data stream has been realized early on, then there is an incentive to invest in efficient and effective data eco-systems, with the consequence benefit in terms of data quality – if sufficient investment in data and data systems is not made, a company whose main revenue stream is the data will quickly go out of business.

In contrast, for most commercial organizations, the potential value of commercial data may be secondary to its original purpose, and so data quality from the perspective of these secondary uses may be poor, even if the value or future revenue streams attached to these secondary uses may be much greater than the original primary use of the data. For such companies, the general importance of data may be recognized, but that does not mean that data eco-systems are getting the right sort of investment. For example, for a company providing a B2C platform, data on the behaviour of consumers can provide potentially actionable and important insight, but ultimately it is the platform itself (and its continued operation 24/7) that is of prime importance. Similarly, for an online retail site, the primary concern is volume of transactions and shipping. Consequently, for these organizations the data issues that arise are richer, more colorful, and often more challenging. This is because poor quality data and systems do not immediately threaten the viability of the business, or main revenue streams. Long-term, yes, there will be an impact upon UX and overall consumer satisfaction, but many operations are happy to take that hit and counter it by attempting to increase transaction volume.

For these organizations it may be recognized that data is important enough for capital to be spent on data eco-systems, but that capital investment may be poorly thought through or not joined-up. Again, there are several symptoms I have learnt to recognize as red flags. Dividing these into symptoms related to strategy and tactical related symptoms, they are,

  • Strategy related symptoms:
    1. Lack of a Data Strategy. The value of data, even the potential of value has not been recognized or realistically thought through. Consequently, there is unlikely to be any strategy underpinning the design of the data eco-system.
    2. Any ‘strategy’ is at best aspirational, being driven top-down by an Exec.  The capital investment is there, but no joined up plan. The organization sets up a Data Science function because other organizations are – typified by hiring of a ‘trophy Data Scientist’. Dig deeper and you will find no real pull from the business areas for a Data Science function. Business areas have been sold Data Science as a panacea, so have superficially bought into the aspirational strategy – who in a business function would not agree when told that the new junior Data Science hires will analyse the poorly curated or non-existent data and will come up with new product ideas that have a perfect market fit and will be coded into a fully productionized model. I have seen product owners who have been sold a vision and believe that they can now just put their feet up because the new Data Scientist will do it all.
    3. The organization cannot really articulate why it needs a Data Science function when asked – this is one of my favourites. Asking ‘why’ here is the equivalent of applying the ‘Five Whys’ technique at the enterprise-level rather than at a project-level. Many companies may respond to the question with the cliched answer, ‘because we want to understand our customers better, which will help us be more profitable’. Ok, but where are the examples of when not understanding this particular aspect your customers has actually hurt the business. If there is no answer given, you know that the need for an machine learning based solution to this task is perceived rather than evidenced.
    4. A mentality exists that data is solely for processing/transacting, not analysing – hence there is little investment in analytics platforms and tooling, resulting in significant friction in getting analytics tooling close to data (or vice-versa). Possibly there is not even a recognition that there is distinction between analytics systems and reporting systems. Business areas may believe that only data querying skills and capability are needed, so the idea that you want to make building of predictive models as frictionless as possible for the Data Science team, is an alien one.
    5. Lack of balance in investments between Data Engineering and Data Science – this is not optimal whichever function gets the bigger share of the funding. This is often mirrored by a lack of linking up between the Data Science team and the Data Engineering team, sometimes resulting in open hostility between the two.
    6. Lack of Data Literacy in business area owners or product owners. Despite this, organizations often have a belief that they will be able to roll-out a ‘self-serve’ analytics solution across the enterprise that will somehow deliver magical business gains.
  • Tactical related symptoms:
    1. Users in business areas are hacking together processes, which then take on a life of their own. This leads to multiple conflicting solutions, most of which are very fragile, and some of which get re-invented multiple times.
    2. Business and product owners ask for Data Science/Machine Learning driven solutions but can’t answer how they are going to consume the outputs of the solution.
    3. The analytics function, whether ad-hoc or formal, primarily focuses on getting the numbers ready for the month-end report. These numbers are then only briefly discussed, or ignored by senior management, because they conflict with data from other ad-hoc reports built from other internal data sources, or they are too volatile to be credible, or they conflict with the prior beliefs of senior management.
    4. Business and product owners consider it acceptable/normal to use a Data Analyst or Data Scientist for, ‘can you just pull this data for me’ tasks.

As a Data Scientists what can you do if you find yourself in a scenario where these symptoms arise? There are two potential ways to approach this,

How do we respond to it – what can we do reactively? Flag it up – complain when systems don’t work and highlight when it they do. Confession here – I’ve not always been good at complaining myself, so this is definitely a case of me recognizing what I should have done, not what I did. On the less political front, you can try and always construct your analytical processes to minimize the impact of any upstream processes not under your control. By this I mean making downstream analysis pipelines more robust by building in automated checks for data quality and domain specific consistency. Where this is not possible, then stop or refuse to work on those processes that are so impacted as to be worthless. This last point is about making others share in your pain or shifting/transferring the pain elsewhere – onto business/product owners, from insight to operational teams. If other teams also experience the pain of poor-quality data, or poorly designed data-ecosystems and analytical platforms, then they have an incentive to help you try and fix the issues. Again, on a less political front, make sure expectations take into account the reality of what you have to deal with. Get realistic SLAs re-negotiated/agreed that reflect the reality of the data you are having to work with – this will protect you and highlight what could be done if the data were better.

How can we prevent it – what can we do proactively? The most effective way for any organization to avoid such issues is to have a clear and joined up Data Strategy and Analytics Strategy. It is important for an organization to recognize that having a Data Strategy does not mean it has an Analytics Strategy. Better still, an organization should have a clear understanding of what informed decisions underpin the future operations of its business model. Understanding this will drive/focus the identification of the correct data needed to support those decisions and hence will naturally lead (via pull) to investment in high-quality data capture processes and efficient data engineering/data analysis infrastructure. Formulating the Data Strategy may not formally be part of your brief, but it is likely that, as a Data Scientist and therefore end-user, you can influence it. This is particularly important in order to make sure that the analytical capability is a first-class and integrated citizen of any Data Strategy. This can be done by giving examples of the cost/pain when it is not so. Very often, a CDO or CTO formulating the Data Strategy will come from an Engineering background and so the Data Strategy will be biased towards Ops. Any awareness of a need for the Data Strategy to support analytics will be focused on near-term analytics, e.g., supporting real-time reporting capabilities. The need to for the Data Strategy to support future Ops by supporting insight, innovation and discovery capabilities may be omitted. If a Data Science function is considered, it will be as a bolt-on – possibly a well-funded Data Science bolt-on, but still a bolt-on.

Defensive & Offensive Predictive Analytics

Harvard_business_review

This article from the Harvard Business Review was a short but interesting read. The article talks about defensive and offensive data management strategies – defensive being about minimizing downside risk, and being more common in regulated industries, whilst offensive data management strategies focus on supporting business objectives and are more common in un-regulated or less regulated industries. The authors rightly point out that the correct mix of defensive and offensive strategies will be specific to each individual organization.

Having worked in a number of commercial sectors, my experience is that the use of predictive analytics within organizations also divides into defensive and offensive activities, irrespective of whether that predictive analytics and data science activity is enabled by a well thought out data management strategy. There are good reasons for this, and again it is largely determined by whether or not the activity of an organization carries a large downside risk.

Consider a company, such as a bank, whose activity has a large downside risk, and where losses due to bad loans can almost wipe out a balance sheet very quickly. My experience of doing analytics in a UK retail bank is that the predictive analytics focus is on modelling that downside risk with a view to understanding it, forecasting it and ultimately mitigating against it. The analytics effort focuses on risk minimization (still an optimization), whilst optimization of the profit side of the P&L is less computationally based, e.g. by committees of human subject matter experts deciding mortgage or savings rates for the coming year.

In contrast, in companies where the downside risk is lower, such as those where transactions with the organization’s customers are on much shorter timescales than a bank, then the use of predictive modelling tends to focus more on the optimization of revenue and profits, rather than minimization of losses from liabilities. Take grocery supermarkets, where predictive demand models are used to set product prices in order to optimize profit. Whilst getting the pricing strategy wrong will impact revenues, it does not lead to the organization holding a long-term liability and is ultimately reversible. Mistakes when using predictive models in this domain are unlikely to take a company down.  

From what I have seen the use of predictive modelling within a business is typically almost binary, i.e. either predominantly on the downside risk, or predominantly on optimizing the business objectives, even though most businesses will have both upsides and downsides to their activity. I haven’t seen that many medium-scale organizations where predictive modelling is used at a mature level across the majority of the business areas or tasks. Even rarer in my experience are situations where predictive modelling of both downside risks and business objectives are done concurrently with the optimization taking into account both sides of the P&L. It would be interesting to find good examples outside, say the largest 50 companies in the FTSE100, DowJones, Nasdaq, or S&P500, where a more joined up approach is taken to using predictive analytics for optimizing the P&L.