Benford's Law Testing: Estimating Required Sample Size for Leading Digits Test

Introduction

This is the second of two posts related to Benford's Law. For more context, check out my previous post.

Aside from the rule of thumb that bigger is better, there is limited literature regarding optimal sample sizes for running a Benford’s Law analysis. At the very least, we want to know the minimum number of samples required for Benford’s Law to be observed in a data set. This number will be a moving target depending on the required accuracy of the analysis.

Conducting a Benford’s Law conformity analysis often requires the selection of some critical value that indicates conformity or nonconformity to the Benford curve. In his book Forensic Analytics: Methods and Techniques for Forensic Accounting Investigations, professor Mark J. Nigrini introduces an analysis of 25 data sets used to develop a set of critical values for the mean absolute deviation statistic. With respect to a first-two digits analysis using the mean absolute deviation test, Nigrini proposes a MAD score of 0.0022 as a reasonable threshold between conformity and nonconformity to Benford’s Law. While it’s name is relatively self-explanatory, the formula for calculating the MAD score is:

Where AP is the observed first digit proportion, EP is the expected Benford's Law proportion, and K is the number of first digit combinations (e.g, K is 90 for the possible first two digit combinations of 10 through 99)

Since I utilized the MAD score in my initial Benford’s Law article, I was curious how large a sample should be in order to run a reliable first-two digits analysis using Nigrini’s critical values. To figure that out, I devised a small experiment where a random sample is drawn from a distribution that we already know conforms to Benford’s Law. We would naturally hypothesize that this sample also conforms to Benford’s Law. If, however, calculating the MAD score for the sample leads to a conclusion that the sample does not conform to Benford’s Law, then we’ll call that a false positive (since we already know the data set does conform to Benford’s Law). If we decide on a tolerable rate of false positives, then we can repeat this experiment thousands of times on increasing sample sizes, counting our false positive results along the way, until the false positive rate is below the tolerable rate of false positives. This should approximate a target minimum sample size at the given MAD cutoff of 0.0022.

Code

The process described above is translated to Python code below, where our base Benford’s Law distribution is the actual Benford’s Law expected probabilities for the first-two digits, and our random sample from the distribution is drawn using numpy’s random number generator. While not very efficient, incrementing the sample size for each iteration of the simulation loop will allow us to see the relationship between sample size and false positive rate. If we were for some reason running this experiment over and over again programmatically, I would consider implementing some divide and conquer logic rather than simply growing the sample size until is_running is false.

import numpy as np
import pandas as pd

"Initial setup of an array of possible first two digit combintations and their expected frequencies according to Benford's Law"
first_two_digits = np.arange(10, 100)
benford_frequencies = np.log10(1 + (1 / first_two_digits))

"Generate a set of first digit frequencies based on Benford's Law probabilities"
def roll_new_dataset(size):
    rng = np.random.default_rng()
    simulated_data = pd.Series(rng.choice(first_two_digits, size, p=benford_frequencies))
    simulated_proportions = simulated_data.value_counts(normalize=True)
    return simulated_proportions

"Calculate mean absolute deviation of a single simulated dataset"
def measure_mad(data):
    abs_deviation = []
    keys = data.keys()

    for idx, digits in enumerate(first_two_digits):
        expected = benford_frequencies[idx]

        if digits in keys:
            actual = data.get(digits)

        else:
            actual = 0
        
        deviation = abs(actual - expected)
        abs_deviation.append(deviation)
      
    mad = np.average(abs_deviation)
    return mad

"Simulation setup"
is_running = True

"Number of trials, i.e., number of synthetic distributions tested"
num_trials = 2500

"Tested sample size n"
n = 1000

"Critical value for mean absolute deviation"
mad_threshold = 0.0022

"Acceptable false positive rate"
fp_rate = 0.05

"Simulation loop"
while is_running:

    mad_count = 0

    for _ in range(num_trials):
        simulated_proportions = roll_new_dataset(n)
        mad = measure_mad(simulated_proportions)
        if mad >= mad_threshold:
            mad_count += 1

    if mad_count / num_trials <= fp_rate:
        print(f"Sample Size: {n}, False positive rate: {mad_count / num_trials}")
        is_running = False

    else:
        print(f"Sample Size: {n}, False positive rate: {mad_count / num_trials}")
        n += 50

Results

From one run, we can see that the sample size of 1000 is too small for the requirements that we defined (MAD critical value of 0.0022 and an acceptable false positive rate of 5%). A sample size of 1700 is the first tested sample size to generate a false positive rate under 5%. As written, the loop stops when we encounter a sample size that generates a tolerable false positive rate. Because we increment by 50 for each iteration of the loop, this code won't find the sample size where the false positive rate equals exactly 5%, but we do know that this number is somewhere between 1650 and 1700 records. If we were dealing with a real world data set containing between 1650 and 1700 records, we could repurpose the code to test the actual amount of records to confirm that it satisfies the required false positive cutoff.

'Sample Size: 1000, False positive rate: 0.9332
Sample Size: 1050, False positive rate: 0.8996
Sample Size: 1100, False positive rate: 0.8264
Sample Size: 1150, False positive rate: 0.7556
Sample Size: 1200, False positive rate: 0.6812
Sample Size: 1250, False positive rate: 0.606
Sample Size: 1300, False positive rate: 0.4888
Sample Size: 1350, False positive rate: 0.4108
Sample Size: 1400, False positive rate: 0.334
Sample Size: 1450, False positive rate: 0.2528
Sample Size: 1500, False positive rate: 0.1972
Sample Size: 1550, False positive rate: 0.1372
Sample Size: 1600, False positive rate: 0.0916
Sample Size: 1650, False positive rate: 0.0704
Sample Size: 1700, False positive rate: 0.0432'

Things to Keep in Mind

This experiment takes random draws from the Benford’s Law first-two digit distribution, which perfectly conforms to the law. Real world data virtually never perfectly conforms to Benford’s Law, so the estimated sample size might be best considered as a reasonable minimum to achieve before running any testing.
This experiment tells us exactly what we asked and nothing more, which is how large a sample ought to be in order to use a MAD cutoff of 0.0022 when conducting a first-two digits analysis. You can still run the test with a smaller sample size, but in that case you’ll need to tolerate higher MAD scores or live with a higher possibility of incorrectly concluding that the data does not conform to Benford’s Law.
By taking random draws from the expected Benford’s Law distribution, we’ve effectively modeled the “null hypothesis” of a Benford’s Law analysis, which is that the data conforms to Benford’s Law. In my opinion, modeling the null hypothesis in this manner was reasonable for estimating a minimum sample size, but similar simulations are not necessarily as useful if you’re trying to determine the statistical significance of a real world analysis. In the real world, it's common for data to simultaneously show both close conformity to Benford’s Law in addition to statistically significant deviations from the expectation under traditional statistical testing. This is where the analyst needs to use professional judgment to separate statistical significance from practical significance.