Comparison of Benford's Law and the proportion of first digits from file sizes of files on my laptop hard drive.

A Christmas Cracker Puzzle – Part 2

Before Christmas I set a little puzzle. The challenge was to calculate the proportion of file sizes on your hard drive that start with the digit 1. I predicted that the proportion you got was around 30%. I’ll now explain why.

Benford’s Law

The reason why around 30% of all the file sizes on your hard disk start with 1 is because of Benford’s Law. Computer file sizes approximately follow Benford’s Law.

What is Benford’s Law?

Benford’s Law says that for many datasets the first digits of the numbers in the dataset follow a particular distribution. Under Benford’s Law, the probability of the first digit being equal to d is,

\log_{10} ( 1 + \frac{1}{d} )\;\;\;.\;\;\;\;\;\; Eq.1

So, in a dataset that follows Benford’s Law, the probability that a number starts with a 1 is around 30%. Hence, the percentage of file sizes that start with 1 is around 30%.

The figure below shows a comparison of the distribution in Eq.1 and the distribution of first digits of file sizes for files in the “Documents” folder of my hard drive. The code I used to calculate the empirical distribution in the figure is given at the end of this post. You can see the distribution derived from my files is in very close agreement with the distribution predicted by Eq.1. The agreement between the two distributions in the figure is close but not perfect – more on that later.

Benford’s Law is named after Frank Benford, whose discovered the law in 1938 – see the later section for some of the long history of Benford’s Law. Because Benford’s Law is concerned with the distribution of first digits in a dataset, it is also commonly referred to as, ‘the first digit law’ and also the ‘significant digit law’.

Benford’s Law is more than just a mathematical curiosity or Christmas cracker puzzle. It has some genuine applications – see later. It has also fascinated mathematicians and statisticians because it applies to so many diverse datasets. Benford’s Law has been shown to apply to datasets as different as the size of rivers and election results.

What’s behind Benford’s Law

The intuition behind Benford’s Law is that if we think there are no a priori constraints on what value a number can take then we can make the following statements,

  1. We expect the numbers to be uniformly distributed in some sense – more on that in a moment.
  2. There is no a priori scale associated with the distribution of numbers. So, I should be able to re-scale all my numbers and have a distribution with the same properties.

Overall, this means we expect the numbers to be uniformly distributed on a log scale.

This intuition helps us identify when we should expect a dataset to follow Benford’s Law, and it also gives us a hand-waving way of deriving the form of Benford’s Law in Eq.1.

Deriving Benford’s Law

First, we restrict our numbers to be positive and derive Benford’s Law. The fact that Benford’s Law would also apply to negative numbers once we ignore their sign should be clear.

We’ll also have to restrict our positive numbers to lying in some range [\frac{1}{x_{max}}, x_{max}] so that the probability distribution of x is properly normalized. We’ll then take the limit x_{max}\rightarrow\infty at the end. For convenience, well take x_{max} to be of the form x_{max} = 10^{k_{max}}, and so the limit x_{max}\rightarrow\infty corresponds to the limit k_{max}\rightarrow\infty.

Now, if our number x lies in [\frac{1}{x_{max}}, x_{max}] and is uniformly distributed on a log-scale, then the probability density for x is,

p(x) = \frac{1}{2\ln x_{max}} \frac{1}{x}\;\;.

The probability of getting a number between, say 1 and 2 is then,

{\rm{Prob}} \left ( 1 \le x < 2 \right ) = \frac{1}{2\ln x_{max}} \int_{1}^{2} \frac{1}{x} dx \;\;.

Now numbers which start with a digit d are of the form a10^{k} with a \in [d, d+1) and k = -k_{max},\ldots,-2,-1,0,1,2,\ldots,k_{max}. So, the total probability of getting such a number is,

{\rm{Prob}} \left ( {\rm{first\;digit}}\;=\;d \right ) = \sum_{k=-k_{max}}^{k_{max}}\frac{1}{2\ln x_{max}}\int_{d10^{k}}^{(d+1)10^{k}} \frac{1}{x} dx\;\; ,

and so after performing the integration and summation we have,

{\rm{Prob}} \left ( {\rm{first\;digit}}\;=\;d \right ) = \frac{2k_{max}}{2\ln x_{max}} \left [ \ln ( d+1 ) - \ln d\right ]\;\;.

Recalling that we have chosen x_{max} =10^{k_{max}}, we get,

{\rm{Prob}} \left ( {\rm{first\;digit}}\;=\;d \right ) = \frac{1}{\ln 10} \left [ \ln ( d+1 ) - \ln d\right ]\;=\;\log_{10} \left ( 1 + \frac{1}{d}\right )\;\;.

Finally, taking the limit k_{max}\rightarrow\infty gives us Benford’s Law for positive valued numbers which are uniformly distributed on a log scale.

Ted Hill’s proof of Benford’s Law

The derivation above is a very loose (lacking formal rigour) derivation of Benford’s Law. Many mathematicians have attempted to construct a broadly applicable and rigorous proof of Benford’s Law, but it was not until 1995 that a widely accepted proof was derived by Ted Hill from Georgia Institute of Technology. You can find Hill’s proof here. Ted Hill’s proof also seemed to reinvigorate interest in Benford’s Law. In the following years there were various popular science articules on Benford’s Law, such as this one from 1999 by Robert Matthews in The New Scientist. This article was where I first learned about Benford’s Law.

Benford’s Law is base invariant

The approximate justification we gave above for why Benford’s Law works made no explicit reference to the fact that we were working with numbers expressed in base 10. Consequently, that justification would be equally valid if we were working with numbers expressed in base b. This means that Benford’s Law is base invariant, and a similar derivation can be made for base b.

The distribution of first digits given in Eq.1 is for numbers expressed in base 10. If we express those same numbers in base b and write a number x in base b  as,

x = x_{1}x_{2}x_{3}\ldots x_{K}\;\;,

then the first digit x_{1} is in the set \{1,2,3,\ldots,b-1\} and the probability that the first digit has value d is given by,

{\rm{Prob}} \left ( x_{1}\;=\;d \right ) = \log_{b}\left ( 1 + \frac{1}{d}\right )\;\;\;,\; d\in\{1,2,3,\ldots,b-1\}\;\;\;.\;\;\;\;\; Eq.2

When should a dataset follow Benford’s Law?

The broad intuition behind Benford’s Law that we outlined above also gives us an intuition about when we should expect Benford’s Law to apply. If we believe the process generating our data is not restricted in the scale of the values that can be produced and there are no particular values that are preferred, then a large dataset drawn from that process will be well approximated by Benford’s Law. These considerations apply to many different types of data generating processes, and so with hindsight it should not come as a surprise to us that many different datasets appear to follow Benford’s Law closely.

This requires that no scale emerges naturally from the data generating process. This means the data values shouldn’t be clustered around a particular value, or particular values. So, data drawn from a Gaussian distribution would not conform to Benford’s law. From the perspective of the underlying processes that produce the data, this means there shouldn’t be any equilibrium process at work, as that would drive the measured data values to the value corresponding to the equilibrium state of the system. Likewise, there should be no constraints on the system generating the data, as the constraints will drive the data towards a specific range of values.

However, no real system is truly scale-free. The finite system size always imposes a scale. However, if we have data that can vary over many orders of magnitude then from a practical point of view, we can regard that data as effectively scale free. I have found that data that varies over 5 orders of magnitude is usually well approximated by Benford’s law.

Because a real-world system cannot really ever truly satisfy the conditions for Benford’s Law, we should expect that most real-world datasets will only show an approximate agreement with Benford’s Law. The agreement can be very close, but we should still expect to see deviations from Benford’s Law – just as we saw in the figure at the top of this post. Since real-world datasets won’t follow Benford’s Law precisely, we also shouldn’t expect to see a real-world dataset follow the base-invariant form of Benford’s Law in Eq.2 for every choice of base. In practice, this means that there is usually a particular base b for which we will see closer agreement with the distribution in Eq.2, compared to other choices of base.

Why computer file sizes follow Benford’s Law

Why does Benford’s Law apply to the sizes of the files on your computer? Those file sizes can span over several orders of magnitude – from a few hundred bytes to several hundred megabytes. There is also no reason why my files should cluster around a particular file size – I have photos, scientific and technical papers, videos, small memos, slides, and so on. It would be very unusual if all those different types of files, from different use cases, end up having very similar sizes. So, I expect Benford’s Law to be a reasonable description of the distribution of first digits of the file sizes of files in my “Documents” folder.

However, if the folder I was looking at just contained, say, daily server logs, from a server that ran very unexciting applications, I would expect the server log file size to be very similar from one day to the next. I would not expect Benford’s Law to be a good fit to those server log file sizes.

In fact a significant deviation from Benford’s Law in the file size distribution would indicate that we have a file size generation process that is very different from a normal human user going about their normal business. That may be entirely innocent, or it could be indicative of some fraudulent activity. In fact, fraud detection is one of the practical applications of Benford’s Law.

The history of Benford’s Law

Simon Newcomb

One of the reasons why the fascination with Benford’s Law endures is the story of how it was discovered. With such an intriguing mathematical pattern, it is perhaps no surprise to learn that Frank Benford was not the first scientist to spot the first-digit pattern. The astronomer Simon Newcomb had also published a paper on, “the frequency of use of the different digits in natural numbers” in 1881. Before the advent of modern computers, scientists and mathematicians computed logarithms by looking them up in mathematical tables – literally books of logarithm values. I still have such a book from when I was in high-school. The story goes that Newcomb noticed that in a book of logarithms the pages were grubbier, i.e. used more, for numbers whose first significant digit was 1. From this Newcomb inferred that numbers whose first significant digit was 1 must be more common, and he supposedly even inferred the approximate frequency of such numbers from the relative grubbiness of the pages in the book of logarithms.

In more recent years the first-digit law is also referred to as the Newcomb-Benford Law, although Benford’s Law is still more commonly used because of Frank Benford’s work in popularizing it.

Benford’s discovery

Frank Benford rediscovered the law in 1938, but also showed that data from many diverse datasets – from the surface area of rivers to population sizes of US counties – appeared to follow the distribution in Eq.1. Benford then published his now famous paper, “The law of anomalous numbers”.

Applications of Benford’s Law

There are several books on Benford’s Law. One of the most recent, and perhaps the most comprehensive is Benford’s Law: Theory and Applications. It is divided into 2 sections on General Theory (a total of 6 chapters) and 4 sections on Applications (a total of 13 chapters). Those applications cover the following:

  • Detection of accounting fraud
  • Detection of voter fraud
  • Measurement of the quality of economic statistics
  • Uses of Benford’s Law in the natural sciences, clinical sciences, and psychology.
  • Uses of Benford’s Law in image analysis.

I like the book because it has extensive chapters on applications written by practitioners and experts on the uses of Benford’s Law, but the application chapters make links back to the theory.

Many of the applications, such as fraud detection, are based on the idea of detecting deviations from the Benford Law distribution in Eq.1. If we have data that we expect to span several orders of magnitude and we have no reasons to suspect the data values should naturally cluster around a particular value, then we might expect it to follow Benford’s Law closely. This could be sales receipts from a business that has clients of very different sizes and sells goods or services that vary over a wide range of values. Any large deviation from Benford’s Law in the sales recipts would then indicate the presence of a process that produces very specific receipt values. That process could be data fabrication, i.e. fraud. Note, this doesn’t prove the data has been fraudulently produced, it just means that the data has been produced by a process we wouldn’t necessarily expect.

My involvement with Benford’s Law

In 2001 I published a paper demonstrating that data from high-throughput gene expression experiments tended to follow Benford’s Law. The reasons why mRNA levels should follow Benford’s Law is ultimately those we have already outlined – mRNA levels can range over many orders of magnitude and there are no a priori molecular biology reasons why, across the whole genome, mRNA levels should be centered around a particular value.

In December 2007 a conference on Benford’s Law was organized by Prof. Steven Miller from Brown University and others. The conference was held in a hotel in Sante Fe and was sponsored/funded by Brown University, the University of New Mexico, Universidade de Vigo, and the IEEE. Because of my 2001 paper, I received an invitation to talk at the workshop.

For me, the workshop was very memorable for many reasons,

  1. I had a very bad cold at the time.
  2. Due to snow in both the UK and US, I was snowbound overnight in Chicago airport (sleeping in the airport), and only just made a connecting flight in Dallas to Albuquerque. Unfortunately, my luggage didn’t make the flight connection and ended up getting lost in all the flight re-arrangements and didn’t show up in Santa Fe until 2 days later.
  3. It was the first time I’d seen snow in the desert – this really surprised me. I don’t know why, it just did.
  4. Because of my cold, I hadn’t finished writing my presentation. So I stayed up late the night before my talk to finish writing my slides. To sustain my energy levels through the night whilst I was finishing writing my slides, I bought a Hershey bar thinking it would be similar to chocolate bars in the UK. I took a big bite from the Hershey bar. Never again.
  5. But this was all made up for by the fact I got to sit next to Ted Hill during the workshop dinner. Ted was one of the most genuine and humble scientists I have had the pleasure of talking to. Secondly, the red wine at the workshop dinner was superb.

From that workshop Steven Miller organized and edited the book on Benford’s Law I referenced above. Hence why I think that book is one of the best on Benford’s Law, although I am biased as I contributed one of the chapters – Chapter 16 on “Benford’s Law in the Natural Sciences”

My Python code solution

To end this post, I have given below the code I used to calculate the distribution of first digits of the file sizes on my laptop hard drive.

import numpy as np

# We'll use the pathlib library to recurse over
# directories
from pathlib import Path

# Specify the top level directory
start_dir = "C:\\Users\\David\\Documents"

# Use a list comprehension to recursively loop over all sub-directories 
# and get the file sizes
filesizes = [path.lstat().st_size for path in list(Path(start_dir).glob('**/*' )) if path.is_file()]

# Now count how many file sizes start with a 1
proportion1 = np.sum([int(str(i)[0])==1 for i in filesizes])/len(filesizes)

# Print the result
print("Proportion of filesizes starting with 1 = " + str(proportion1))
print("Number of files = " + str(len(filesizes)))

# Calculate the first-digit proportions for all digits 1 to 9
proportions = np.zeros(9)
for i in range(len(filesizes)):
    proportions[int(str(filesizes[i])[0])-1] += 1.0 
    
proportions /= len(filesizes)

© 2025 David Hoyle. All Rights Reserved

A Christmas Cracker Puzzle – Part 1

Here’s an interesting little coding challenge for you, with a curious data twist at the end. Less of an Advent of Code challenge and more of a Christmas Cracker puzzle.

Without knowing anything about you, I will make a prediction about the size of the files you have on your computer. Christmas magic? Maybe. Read on to find out.

There are two parts to this puzzle

Part 1 – Obtain a list of file sizes (in bytes) of all the files on your computer that are in a high-level directory and its sub-directories. We’re talking here about files with a non-zero file size. We also talking about actual files, so the folders/directories themselves should not be included, just their file contents. If this is your own laptop the high-level directory could be the Documents directory (if on Windows). Obviously, you’ll need to have permission to recurse over all the sub-directories. You should choose a starting directory that has a reasonably large number of files in total across all its sub-directories.

Part 2 – Calculate the proportion of files whose file size starts with a 1. So, if you had 10 files of sizes, 87442, 78922, 3444, 9653, 197643, 26768, 8794787, 22445, 7654, 56573, then the proportion of files whose file size starts with a 1 is 0.1 or 10%.

The overall goal

The goal is to write code that performs the two parts of the challenge in as efficient a way as possible. You can use whatever programming languages you want to perform the two parts – you might use the command line for the first part and a high-level programming language for the second, or you might use a high-level programming language for both parts.

I used Python for both parts and used the Documents directory on my laptop as my starting directory. I had 96351 files in my Documents folder and its sub-directories. The proportion of my files that have a size starting with 1 is approximately 0.31, or 31%.

The twist

Now for the curious part. You should get a similar proportion, that is, around 30%. I will explain why in a separate post (part 2) in the new year, and why if you don’t it can tell us something interesting about the nature of the files on your computer.

© 2024 David Hoyle. All Rights Reserved

Multiplication palindromes

A bit of fun mathematics here. On a cross-country train journey recently, I saw this post from @abakcus on X (formerly Twitter),

Be careful if you do the above calculation yourself. The result has 17 digits and so we’re hitting the limit of what a double precision 64-bit floating point number can represent. This means you may get some error in the calculation in the least significant digits depending on which language you use. Python uses Bignum arithmetic and so we don’t have to worry about precision issues when multiplying integers in Python. To do the calculation in R I used the Ryacas package, which allows me to do infinite precision calculations. The R code snippet below shows how,

library(Ryacas)
yac_str("111111111^2")

which gives us the output below.

"12345678987654321"

The result of the multiplication, with its palindromic structure, is intriguing. But then you remember that we also have,

11 x 11 = 121

This made me wonder if there was a general pattern here. So I tried a few multiplications and got,

1 x 1 = 1
11 x 11 = 121
111 x 111 = 12321
1111 x 1111 = 1234321
11111 x 11111 = 123454321
111111 x 111111 = 12345654321
1111111 x 1111111 = 1234567654321
11111111 x 11111111 = 123456787654321
111111111 x 111111111 = 12345678987654321

Pretty interesting and I wondered why I had never spotted or been shown this pattern before.

On the return train journey a few days later I decided to occupy a bit of the time by looking for an explanation. With hindsight, the explanation was obvious (as is anything in hindsight), and I’m sure many other people have worked this out before. So my apologies if what I’m about to explain seems trivial, but for me this was something I hadn’t spotted before nor encountered before.

The first question that sprang to mind is, does it generalize to other bases? That is, in base n do we always get a pattern,

1n x 1n = 1n
11n x 11n = 121n
111n x 111n = 12321n

and so on. Obviously, there is a limitation here. In base n, the highest digit we can see in any position is n-1. So a more precise statement of the question is: for any base n and for any integer k < n, do we always have,

Let’s try it out. Remember for each base n we will have n-1 examples. The base 2 case (see below) is trivial, but it is still worth explicitly stating to highlight the pattern.

The red numbers denote the place-value corresponding to each digit position. The base 3 case gives two examples (see below),

Again, red numbers indicate place-values. We have also converted to a base 10 calculation in the middle to help work out what the final result on the far right-hand side should be.

The base 4 case gives three examples (see below). Now the annotation with the red numbers is getting a bit crowded, so we have rotated some of them to fit them all in. We have also connected the red numbers with lines to their corresponding digit to make the connection clearer.

The base 5 case gives four examples (see below),

Ok, so there does appear to be a pattern emerging. Can we prove the general case? Let’s do the equivalent to what we have done above, i.e. convert the calculations from a place-value representation to a numeral. The base n number consisting of k 1s is,

Meaning that when we square the base n number that consists of k 1s, we get,

Similarly, we can write the base n number 1234…k….4321 as,

The final line in Eq.3 is the same as Eq.2, so yes, we have proved the pattern holds for any base n and for any k < n.

However, the above calculation feels a bit disappointing. By converting from the original place-value representation we hide what is really going on. Is there a way we can prove the result just using the place-value representations?

Remember we can break down any multiplication a x b as the sum of separate multiplications, each of which involves just one of the digits of b multiplying a. So we can write 11…1 x 11…1 as,

We can make the idea more concrete by looking at a specific example. Take the five digit number 11111. We can write the multiplication of 11111 by itself in the following way,

Note we haven’t said what base we’re in. That is because the decomposition above holds for any base. Now remember that when we multiply by a number of the form 00010, we shift the digits of whatever number we’re multiplying one place to the left. So, 11111 x 00010 = 111110. Likewise, when we multiply with 00100, we shift all digits two places to the left. If we multiply with 01000, we shift three digits to the left.

Now, with all that shifting of digits to the left we’re going to need a bigger register to align all of our multiplications. We’ll add zeros at the front of each multiplication. For example, 11111 x 00100 = 1111100 now becomes 11111 x 00100 = 001111100. With the extra zeros in place, we can finally write 11111 x 11111 as,

You can now see how all the 1s on the right-hand side line up in columns, allowing us to count how many we’ve got. Now we must impose the restriction that we are in base n > 5 so that when we add up all the 1s in a column we don’t get any carry over. If we put the above result in a table format, the column totals become clearer – see below.

Now it becomes clearer why we get the result 11111 x 11111 = 123454321 for any base n > 5. It is due to the shifting and aligning of multiple copies of the original starting number 11111.

The generalization to any number of 1s in the number we are squaring is obvious. This means we can also extend the palindromic pattern beyond base 10, for example to hexadecimal and multiply the hexadecimal number 11111111111111116 by itself to get,

11111111111111116 x 11111111111111116 = 123456789ABCDEFEDCBA987654321

Which means we get a pleasing palindrome that includes the alphabetic hexadecimal digits A,B,C,E,F. Try it in Python using the codeline below,

hex(0x111111111111111 * 0x111111111111111)

You should get the output ‘0x123456789abcdefedcba987654321’

© 2023 David Hoyle. All Rights Reserved