Summary

A sentinel value is a special value of a variable that is used to signify something, or draw the users attention to something.
A sentinel value is often used to distinguish between two states; “it worked ok”, “it didn’t work ok”.
Don’t choose a sentinel value that can inadvertently be interpreted or processed as a normal value of the variable in which it is stored.
Choose a sentinel value such that if it becomes corrupted it can’t accidently still be interpreted as indicating either of the states it is intended to distinguish between.
If a sentinel value is processed is processed by a downstream calculation, the sentinel value should be such that the downstream calculation generates an exception.
If processing a sentinel value generates an exception, the choice of sentinel value should be such that the exception is generated as close as possible (in code) to the place where the sentinel value is first inappropriately processed.

Introduction

All the posts in my Data Science Notes series are based on techniques I use regularly, or on conversations I’ve had with other Data Scientists where I’ve explained why experience has taught me to do something in a particular way. This post is no exception. It stemmed from someone asking why I initialize all my arrays with NaNs, particularly when I’m going to later reinitialize each array element to zero when I start adding to it. It’s because I want the NaN value to act as a sentinel value.

What is a sentinel value I can hear you asking? Well, a sentinel value watches. It warns. A bit like a lighthouse. It communicates when something happened or went wrong. Okay, but isn’t that just a flag? Can’t I just use a Boolean variable when something went wrong? Here’s the thing, often we want to use or store a sentinel value in a variable that is already processed as part of a calculation.

Here’s an example. Say I have an array of variances that I need to calculate. That array of variances might be the variances of a set of features I’m using in a predictive model. But for some of the features I might not be able to calculate the variance. The feature value might be missing across all but one of the training data points. No problem, I’ll just set the variance of that feature equal to -1. Since every Data Scientist knows a variance has to be non-negative, a value of -1 clearly indicates that the variance for that feature has not been calculated, and it is not available for any downstream calculation. That is a sentinel value in action. The -1 is the sentinel value. It has communicated to me when the variance calculation has not worked but it is not a flag variable. It is just a different value stored inside a normal variable, the feature variance in this case.

In the example above there are clear and obvious reasons why I chose -1 as the sentinel value for my array of variances. Now, here’s the thing; choosing good sentinel values can be a bit of an art learnt through experience. To demonstrate this, I’m going to give two real anecdotes from my career as a Data Scientist where good sentinel values were not used. From these two anecdotes we can distil a number of useful lessons about what makes a good sentinel value.

First anecdote:

On one project I worked on we had a piece of code that produced some parameters estimates for a predictive model, and some associated parameter uncertainties. Interestingly, the parameter uncertainties had a sentinel value of 9999.99. A parameter uncertainty of 9999.99 was supposed to indicate that we hadn’t been able to estimate the parameter – there were a number of genuine reasons why this could happen. The value of 9999.99 was considered too large to be mistakenly confused for a genuine parameter uncertainty. So whenever we saw a parameter uncertainty of 9999.99 we knew that the parameter had in fact not been estimated. And whenever we saw a parameter uncertainty less than 9999.99 we knew that the parameter had been estimated. A few years later I discovered that it was possible for a downstream processing step, under certain circumstances, to modify those 9999.99 values to something different. The final result was a parameter uncertainty which was large, e.g. 9548.17, but not 9999.99. We’d lost the ability to interpret not being 9999.99 as an indication of an estimated parameter. Fortunately, there were other ways we could tell whether the parameter had been estimated other than through the sentinel value, but the lessons learnt from this anecdote are two-fold,

Don’t choose sentinel values that can inadvertently be interpreted as something different to what the sentinel value was intended to indicate. In this case the 9999.99 was intended to be implausibly large for a parameter uncertainty. However, although implausible it is not impossible, and so it is possible for it to be accidentally processed in a way it shouldn’t. And accidents will happen. In fact, you should assume that if an accident can happen, it will happen.
The second lesson learnt here is that a sentinel value flags a condition or state. So we use sentinel values to infer two situations. In this example, even with the mistaken downstream processing of the sentinel value, a final result of 9999.99 would still have indicated an un-estimated parameter, so it would look like the sentinel value of 9999.99 was doing its job. However, we couldn’t reliably infer the converse case. A large parameter uncertainty that was less than 9999.99 could not be used on its own to reliably infer that the parameter had been estimated. So when designing a good sentinel value we need to think about how it will be used to infer either of the two scenarios, and whether correct inference of either scenario is always possible.

Second anecdote:

This second anecdote is not about what happens when you have a bad choice of sentinel value, but about what happens when you don’t have a sentinel value at all, and what that reveals about what you want to happen when a sentinel value is accidentally processed.

I was working on an academic project with a software developer. The developer was a C++ expert with many years of experience. We were modifying an existing very large academic codebase that analyzed Genome Wide Association (GWAS) data . I was supplying the additional equations needed, they were implementing them. During a particular three week period the developer had been chasing down a corner-case bug in the existing code. After three weeks they had managed to track down the bug to a particular subroutine producing -inf values in its output for this particular corner-case. They wanted advice on how to handle this scenario from a scientific perspective. What should we replace the -inf values with, if at all? The first thing I suggested was just looking at what had been the input to the subroutine in this case. Yep, it was -inf as well. In fact, we traced the -inf values back through three further subroutine calls. The -inf values had first arisen from zero values being present in an array where they shouldn’t have been. When first processed, the zeros had generated the -inf values which then got further processed by three subroutine calls. Okay, part of the issue here is that there should have been a sentinel value used in place of the default value of zero. Ideally, one would want that sentinel value to be incapable of being processed by any of the downstream subroutine calls. But here’s the thing; you would have thought that the -inf value, once created, would serve as some sort of sentinel value for later subroutine calls. The lessons I learnt from this anecdote were also two-fold, namely,

A good sentinel value that is accidently processed by a function should ~~cause your program to crash~~. Sorry, “ahem, cough”, cause your program to generate an instance of the custom exception class you’ve beautifully written, which your code then catches and gracefully handles.
The exception raised by the inadvertent/inappropriate processing of a well chosen sentinel value should occur as close as possible to the point where the sentinel value is first inappropriately processed, not three subroutine calls down the line. That way the sentinel value also provides useful debugging information.

Conclusion

Choosing good sentinel values is a bit of an art form, learnt from hard-won experience. But there are some general, high-level rules and guidelines we can give. In order of importance these are,

Don’t choose a sentinel value that can inadvertently be interpreted or processed as a normal value of the variable in which it is stored. A sentinel value is meant to be exceptional, not just different.
A sentinel value is often used to distinguish between two states. Choose a sentinel value such that if it becomes corrupted it can’t accidently still be interpreted as indicating either of those states. If a sentinel value becomes corrupted it should become meaningless.
If a sentinel value is processed is processed by a downstream calculation, the sentinel value should be such that the downstream calculation generates an exception.
If processing a sentinel value generates an exception, the choice of sentinel value should be such that the exception is generated as close as possible (in code) to the place where the sentinel value is first inappropriately processed.

Try using sentinel values in your coding. The only way to get better at using them is to try.

Hoyle Analytics

Data Science Notes: 3. Choosing good sentinel values

Summary

Introduction

First anecdote:

Second anecdote:

Conclusion

Leave a comment Cancel reply

Summary

Introduction

First anecdote:

Second anecdote:

Conclusion

Share this:

Leave a comment Cancel reply