SUMMARY: Examining the number of votes in ~ 6,000 Florida counties during the 2016 presidential election shows that a Benford type analysis with the first digit can indicate fraud when counties have both normal and log normal distribution components. Without prior knowledge of the overall frequency distribution of votes at the district level without fraud, I see no way of applying Benford’s legal analysis of the first digit to infer fraud. A similar analysis would have the same problem as it depends on the expected frequency distribution of the number of votes, which is difficult to estimate as it is tantamount to know a voting result without fraud. Instead, it might make more sense to simply examine the county-level vote distributions than to analyze that Benford-style data and compare the distribution of one candidate with that of other candidates.
It’s only been a week since someone introduced me to Benford’s law to identify election fraud. The method looks at the first digit of all the votes that have been reported in many (e.g., thousands) counties. If the vote count can be assumed to have either a log normal distribution or a 1 / X distribution with no fraudulently inflated values, then the relative frequency of the first digits (1 through 9) has very specific values that deviations could indicate fraud.
After a weekend examining the Philadelphia vote count during the 2020 presidential primary election, my results were inconsistent. Next, I decided to examine data from the 2016 Florida election (data from the 2020 general election is hard to find). I wanted to find out if Benford really could apply to the total number of votes if there was no evidence of widespread fraud. With Trump votes in the Philadelphia 2020 primary, the answer was yes, the data closely followed Benford. But that was just an election, a candidate and a city.
When I analyzed the 2016 general election data for Florida, I saw deviations from Benford’s law in both Trump and Clinton:
For at least the first digit values ”3″ and “4”, the results are far outside the expected values if the underlying frequency distribution of votes was truly log normal.
This prompted me to examine the original frequency distributions of the votes, and then I saw the reason: Both the Trump and Clinton frequency distributions have elements of both log normal and normal distribution.
This contradicts the basis for Bendford’s analysis of the voting data according to the law-type: it assumes that the number of votes follows a certain frequency distribution (lognormal or 1 / x) and that votes are added fraudulently (AND these fake additions are roughly normally distributed!) , then the analysis of the 1st digit deviates from Benford’s law.
Since the analysis of Benford’s law depends on the underlying distribution being purely lognormal (or the form of the 1 / x power law), it appears that understanding the results of an analysis of Benford’s law depends on the expected shape of these vote distributions depends … and it’s not an easy task. Is the expected distribution of the total votes really logarithmically normal?
Why should the voting distributions in the district have a logarithmically normal form?
Benford’s legal analysis of voting data depends on the expectation that there will be many more low-polling counties than high-polling counties. Obviously, polling stations in rural areas and small towns will not have as many voters as polling stations in large cities, and there will probably be more of them.
As a result, the district-level vote numbers tend to have a frequency distribution with more lower-level votes and fewer high-level votes. To get results of the Benford law type, the distribution must have either a logarithmic normal or a power law form (1 / x).
However, there are reasons why we might expect the number of votes to have a more normal (rather than a logarithmic) distribution.
Why can voting results at the district level differ from log-normal?
While I don’t know the details, I would expect the number of voting locations to be scaled so that each location can handle a decent amount of voter traffic, right?
To illustrate my point of view, one could imagine a system in which ALL polling stations, whether in the city or in the country, are optimally designed to treat approximately 1,000 voters with the expected turnout.
In cities, these might be every few blocks. In rural Montana, some voters may have to travel 100 miles to vote. In this imaginary system, I think you can see that the district level would then have the number of votes distributed more normally, with an average of around 1,000 votes and as many districts with 500 votes as 1,500 districts (instead of much lower votes). Constituencies as electoral districts with high votes, as is currently the case).
But we don’t want the country voters to have to drive 100 miles to vote, do we? And there may not be enough public space to have polling stations every two blocks in a city, and as a result, some VERY high votes can be expected from overcrowded city polling stations.
We instead have a combination of the two distributions: log normal (because there are many rural locations with few voters and some urban polling stations are overcrowded) and normal (because cities tend to optimize the district’s locations with a certain number of voters, as best you can).
Benford-type analysis of synthetic normal and log-normal distributions
If I create two sets of synthetic data with 100,000 values each, one with a normal distribution and one with a log normal distribution, the relative frequencies of the first digit of these vocal sums are as follows:
The results for a normal distribution vary considerably depending on the assumed mean and the standard deviation of this distribution.
I believe what is going on in the Florida County data is simply a combination of the normal and log-normal distributions of the total votes. For various reasons, the number of votes does not follow a normal logarithmic distribution and therefore cannot be interpreted using Benford’s law-type analyzes.
It is easy to imagine that other reasons for the frequency distribution of votes at the district level deviate from log normality.
What it would take is compelling evidence that the frequency distribution should be fraud-free. However, I don’t see how this can be done unless one candidate’s vote distribution is extremely skewed compared to another candidate’s totals or compared to the primary totals.
And that is exactly what happened in Milwaukee (and other cities) in the last election: The analysis of the Benford Act revealed very different frequency distributions for Trump than for Biden.
I would think it makes more sense to just look at the raw district-level vote distributions (e.g., as in Figure 2) than a Benford analysis of this data. The Benford analysis technique suggests some kind of magical, universal relationship, but it is simply the result of a log normal distribution of the data. Any deviation from the Benford percentages simply reflects the underlying frequency distribution, which is different from the log normal and does not necessarily indicate fraud.