In a recent webinar for the International Association of Forensic Toxicology Consultants, nationally recognized DUI expert Josh Ott delivered a comprehensive analysis of Standardized Field Sobriety Tests (SFSTs) that challenges common assumptions about these roadside evaluations.
Drawing on over a decade of experience conducting thousands of DUI investigations as a law enforcement officer and his current work providing expert testimony, Ott examined the actual research behind field sobriety testing and revealed troubling gaps between what officers are taught and what the science actually shows.
The Fundamental Misunderstanding
Perhaps the most critical point Ott emphasizes is that field sobriety tests were never designed to measure impairment. This statement contradicts what officers testify to in courtrooms across America every day.
The tests were developed and validated solely for one purpose: to discriminate between drivers at or above the legal BAC limit from those below it. They have never been studied for their ability to measure driving impairment, drug impairment, or even alcohol impairment itself.
As Ott notes, “I would say probably throughout the United States right now, officers are testifying in court they observed X number of validated clues of impairment for each of these tests. And that ends up being a very big problem.”
The Development and Validation Studies
The standardized field sobriety tests emerged from research conducted by the Southern California Research Institute with funding from NHTSA starting in 1975. Three major tests were identified as most accurate:
Horizontal Gaze Nystagmus (HGN)
Walk and Turn
One Leg Stand
However, Ott points out a concerning pattern: Dr. Marcelline Burns authored five of the six major studies used to develop and validate these tests. This concentration of authorship raises questions about independent verification of the research.
The San Diego Study: What Officers Don’t Learn
The 1998 San Diego field validation study is the most commonly cited research when officers testify about SFST accuracy. Officers learn that the tests showed 91% overall accuracy, with HGN at 88% accurate, Walk and Turn at 79% accurate, and One Leg Stand at 83% accurate.
What they don’t learn is equally important.
The False Positive Problem
Ott calculated false positive rates that were never published in the study itself:
HGN: 37% false positive rate (one in three people below 0.08 showed 4+ clues)
Walk and Turn: 52% false positive rate (statistically equivalent to flipping a coin)
One Leg Stand: 41% false positive rate (nearly the same as guessing)
Overall arrest decision: 28% false positive rate (more than one in four drivers incorrectly arrested)
To calculate these rates, you have to dig into the raw data matrices provided in the study. They are not prominently featured in the conclusions.
Study Design Concerns
Several aspects of the San Diego study raise red flags:
Sample characteristics: Of 297 drivers in the study, only ONE refused chemical testing (less than 1%). In actual practice, refusal rates are typically around 50%. This extreme compliance suggests the sample may not be representative.
BAC distribution: The average BAC of arrested drivers was 0.15%, while non-arrested drivers averaged below 0.05%. This wide separation makes discrimination easier. The tests were supposedly validated at 0.08%, but most subjects were far from that threshold.
PBT influence: Officers had access to preliminary breath tests. While NHTSA assures us officers recorded their estimates before viewing PBT results, the data raises questions. Of 30 people who were false positives on HGN, only 16 were actually arrested. Why would officers disregard what they’ve been trained is a “silver bullet” test in nearly 50% of cases?
The Training Disconnect
Perhaps most troubling is what officers are taught about errors. The ARIDE instructor manual tells officers that “research has demonstrated that officers are more likely to err on behalf of the defendant.”
The actual data from San Diego shows exactly the opposite. There were 24 false positives compared to only 4 false negatives—making false positives 14 times more likely than false negatives. Officers are trained to believe the opposite of what the research actually demonstrates.
The Robustness Study: Changing the Rules
In 2007, Dr. Burns published the “Robustness of the Horizontal Gaze Nystagmus Test” study, funded by NHTSA. The purpose was to address defense arguments that incorrect administration compromises HGN accuracy.
Shocking False Positive Rates
When HGN was administered correctly:
67% false positive rate at BAC < 0.08 (two out of three people)
65% false positive rate at BAC < 0.05
85% false positive rate at BAC < 0.03
When administered incorrectly:
Stimulus too high: 91% false positive rate
Stimulus too low: 79% false positive rate
Stimulus too close: 92% false positive rate
Stimulus too far: 84% false positive rate
Despite these extraordinarily high false positive rates, Dr. Burns concluded that HGN is a “robust procedure,” meaning administration variations don’t compromise accuracy.
The Standard That Never Was
How did Dr. Burns reach this conclusion with such damning data? Ott discovered she had changed the validation standard.
Since the 1998 San Diego study, the criterion has been that 4+ clues on HGN indicate a BAC of 0.08 or higher. Dr. Burns taught this standard to officers herself.
But in the robustness study, she stated that the criteria “as defined in the SFST curriculum” were that 4 clues indicate a BAC of 0.03 or higher, a standard that has never appeared in any SFST curriculum.
By lowering the threshold from 0.08 to 0.03, she dramatically reduced the number of results classified as false positives. As Ott states, “Dr. Burns took her opinions and then took the data and changed the standard to make it fit her opinions instead of taking the data, applying the correct standard and forming her opinions on that.”
The DRE Technical Advisory Panel later retracted this study from their manuals, though it has not been officially retracted.
The JAMA Study: Sober People Fail These Tests
The 2023 Journal of the American Medical Association study on field sobriety tests and cannabis provides crucial data: it included a placebo-dosed control group, allowing researchers to see how completely sober people perform on these tests.
Participants were marijuana users who were screened physically and mentally, tested for drugs and alcohol on the study day, and performed on a driving simulator before receiving placebo doses. There was no evidence of residual effects from prior marijuana use.
Results for the 63 placebo-dosed (sober) individuals:
Walk and Turn: 56% false positive rate
One Leg Stand: 37% false positive rate
Combined: 28% false positive rate
More than one in four completely sober people were deemed impaired based on these tests. These rates closely match the San Diego study findings.
What This Means for the Courts
The implications of this research are profound:
These tests were never validated for impairment. The San Diego study explicitly states, “the only appropriate criteria and measure to assess the accuracy of the standardized field sobriety test is BAC, and measures of impairment are irrelevant.”
False positive rates are substantial and largely unknown. Officers receive no training on false positive rates, and even if they read the studies, the rates aren’t published in the conclusions.
Officers are trained incorrectly about error direction. They’re told errors favor defendants when the data shows false positives far exceed false negatives.
The peer review problem. The San Diego study has never been peer reviewed, yet it’s used to satisfy Daubert or Frye standards. A paper about the study was peer reviewed, but not the study itself.
Not all officers are equal. A boating under the influence study found that 20% of experienced officers with an average of nearly 10 years administering SFSTs were less than 50% accurate with HGN. Yet courts treat all officers as equally reliable.
Critical Knowledge Gaps
Ott identifies several factors that have never been properly studied:
Age effects: A 21-year-old and a 64-year-old are evaluated identically, despite obvious differences in balance and physical ability
Weather conditions: Despite claims that these tests work in any conditions, no controlled studies exist
Surface conditions: The impact of uneven surfaces, grades, or different materials hasn’t been systematically studied
Footwear: Tennis shoes versus flip-flops versus heels versus barefoot
Injuries and medical conditions: Beyond basic exclusions, the impact hasn’t been quantified
The Challenge Ahead
As Ott notes, “If you tell a lie long enough, it becomes the truth.”
The belief that field sobriety tests measure impairment is so deeply ingrained in the criminal justice system that challenging it faces enormous resistance.
Every day, officers testify that these tests indicate impairment. Prosecutors rely on this testimony. Judges accept it. Even many defense attorneys believe it.
The uphill battle is not just about correcting the science; it’s about overcoming decades of institutional momentum.
Josh Ott is a nationally recognized expert in DUI investigations and SFST administration with over a decade of law enforcement experience. He currently works with Case Lock Incorporated, providing expert testimony and case review services in alcohol and drug impairment cases.
The International Association of Forensic Toxicology Consultants hosts monthly webinars on topics relevant to forensic toxicology. For more information about upcoming presentations, visit iaftc.org.
Clips
Related podcasts
The Accuracy of Standardized Field Sobriety Testing (SFST) with Dr. Greg Kane
Challenging the Validity of SFSTs and Drug Recognition Testing with David Rosenbloom
Are Standardized Field Sobriety Tests (SFSTs) Scientific? - Josh Ott
The Eyes Have It (Wrong): Horizontal Gaze Nystagmus Through an Ophthalmologist’s Skeptical Lens
Transcript (automated - not checked for errors)
[00:00:00] Introduction and Speaker Background
Aaron Olson: today we’ve got. Josh Ott, he’s gonna present to us about , field sobriety testing and his work that he’s been doing on this. He’s a nationally recognized expert in DUI Investigations, SFST administration, HGN Interpretation, and he draws on more than a decade of experience at the Roswell, Georgia Police Department.
He’s conducted literally thousands of DUI investigations. He served as a drug recognition expert, instructor, and trained. Officers across multiple disciplines and jurisdictions. His background includes work in the motor and traffic enforcement unit, investigation for serious injury and fatality collisions along, and he’s also received numerous awards for his contributions today.
He works for Case Lock Incorporated and provides expert testimony and case review services and alcohol and drug impairment cases. Josh, thanks so much for doing this for us today.
Joshua Ott: Well, thanks for the invite. Good afternoon everyone. So if anyone has any questions as we go, just feel free to unmute your, um, video and, and immediately ask the question so we can address it at, at that time.
[00:01:06] Purpose and Importance of Standardized Field Sobriety Tests
Joshua Ott: But what we’re gonna be talking about. Are the studies that have been conducted into the standardized field sobriety test. And some of the topics we’re gonna address is the purpose, the development and validation.
And the three major studies that we’re gonna focus in on are the San Diego study, the Robustness of the Horizontal Gaze Nystagmus Test Study, and the recent um, 2023 Journal of American Medical Association study. So what is the purpose of the standardized field sobriety test? Number one is to have standardized tests.
And this is important because we want all officers administering and interpreting tests the same way across the United States. So ultimately, when we watch a video, if we’re being objective with it, we can determine, number one, that the officer administer the test correctly. Number two. Did the, how did the defendant actually perform on the test so we can interpret it for ourselves and say, on the walk and turn, we did see two or more clues.
And then by having them validated, we now know how accurate and accurate the test is supposed to be and what the results actually indicate to us. These are intended to be screening tests, so they’re obviously used in court well beyond the scope of what they were intended for, but they were intended for helping an officer determine whether or not they should make an arrest based on the standard of probable cause.
And so for that purpose, in my opinion, these tests are absolutely beneficial because if we looked at a normal DUI investigation between vehicle motion and personal contact, when the officer’s first making contact with the driver, probably within about 30 seconds, in majority of investigations, the officer already has probable.
And so without having any of these tests, a lot more drivers would be going to jail than what ends up happening after the test. Because there are numerous drivers who initially, the officer suspects that their DUI, they administer the field sobriety test and ultimately the officer determines they should not be arrested.
And in my personal experience, about 50% of the people I did field sobriety on. We’re not charged with DUI. So for that aspect, these tests are beneficial. But when we get and look at what these tests have been validated for, at least according to Nitsa, and then the false positive rates of these tests, I think we’re all going to agree that to use these tests to determine or show or prove whether or not someone is not is or is not impaired, the research just doesn’t back up that purpose for them.
Then lastly, and this is very important to understand, these tests have only been studied for their ability to discriminate if a driver is at or above the legal limit from those who are below. They have never been designed or studied for their ability to measure driving impairment. Drug impairment or even alcohol impairment, which is a very common incorrect assumption and belief about these tests.
And I would say, you know, probably throughout the United States right now, officers are testifying in court. They, they observed X number of validated clues of impairment for each of these tests. And that ends up being a very big problem. And I, I say it’s one of those things that if you tell a lie long enough, it becomes the truth.
And that’s the uphill battle that all of us as experts are really facing with these tests is that. Judges, prosecutors, law enforcement officers, even defense attorneys a lot of times believe that these tests indicate impairment when that’s absolutely not the case.
[00:04:51] Development and Validation of Sobriety Tests
Joshua Ott: So let’s talk about the overview of the development and validation of these tests.
So prior to the early 1970s, there were no standardized tests for officers to use to help them determine whether or not they should arrest a driver for DUI. So what was occurring was officers were using many different tests. There was no standardization for the instructions or the clues that they were using.
And on top of that, and probably most importantly, there was no validation of those tests. So it means that it is unknown as to how accurate or inaccurate the tests were. So one of the problems that would occur is that when they went to court, judges and juries had no idea how much weight to give to the test.
So starting in 1975, the Southern California Research Institute with funding from the National Highway Traffic Safety Administration, nitsa. Decided to look into which of the test officers were using were the most accurate, and they ended up releasing three final reports. So the first report is the 1977, which was a lab study only, and this is the report that ultimately determined that the three most accurate tests were what was labeled as alcohol Gaze nystagmus.
Now it’s been changed to horizontal gaze nystagmus. The walk and turn and one leg stand. As you can see, the false positive rate of this study was 27%. And all three of these studies that we’re gonna be talking about were dealing with a 0.1 BAC, which was the legal limit throughout the United States at that time period.
And then we had two additional studies that occurred after they created the standardization for the instructions and the clues for the test. The first one was 1981, which was a lab and field study. As you can see, the false positive rate in the lab with placebo dosed individuals was 18%. So one out of every five placebo dosed individuals was incorrectly identified as being a 0.1 or higher in the study.
And then the last study was the 1983, which was a field study only. And this one, I was never able to see and calculate what the actual false positive rate was. All I can tell you is that that 94% of the officer’s incorrect decisions were false positive. So if an officer made a mistake in this study or made an error, it was going to be a likely a false.
A couple additional things I wanna point out with these studies, we’re not gonna go into the weeds with them, is that Dr. Marline Burns authored both the 77 and 81 um, study. So she was one of the authors on both of those studies. She did not author the 83 study. And this is gonna become a very important theme as we go through these tests and studies and get specifically into the robustness of the horizontal Gaze Nystagmus test study.
So what the original data found is that when four or more clues observed on HGN, it was 77% accurate, that the person was a 0.1 or higher. When two or more clues on the walk and turn was observed, it was 68% accurate, that the person was a 0.1 or higher, and two or more clues on the one leg stand was 65% accurate, that the person was a 0.1 or higher.
So this is the original data that they came out with for these tests. Now with all of those original studies, it was inexperienced officers that were being used. Obviously you can’t have an experienced officer if you just created these tests. So now in the nineties we have officers with a decade or more experience administering the field sobriety test.
So at that point, they wanted to do large field validation studies to determine if with experienced officers. These tests were just as accurate in the field as what they were in the laboratory.
[00:08:37] Key Studies and Their Findings
Joshua Ott: So the first study occurred in Colorado in 1995. The second study was in Florida in 1997, and the third study was in San Diego in 1998.
Now just briefly talk about the Colorado study. This one was again dealing with a 0.1 BAC, but in NHTSA’s insanity. The Colorado has a law where it’s driving while impaired if you’re a 0.05 or above. So even though it was to validate the test at a 0.1 threshold, if the officers arrested you and you were anywhere from a 0.05 and above, it was considered a correct arrest decision.
So ultimately it, it is just showing to me a, a lack of, um, not being biased in the study and trying to just validate these tests by lowering the threshold as to what was a correct decision versus what was an incorrect decision when you lower it to a 0.05, but are validating it. 4.1, the Florida study was the first study deal with a 0.08 BAC.
And then the San Diego study was also dealing with a 0.08 BAC. Additionally, all three of these studies were authored by Dr. Marcelline Burns. So of the six studies that were used to develop and validate the field sobriety test, Dr. Marcelline Burns was an author on five out of the six studies. So before we get into the actual data from these last studies, we have to understand what constitutes a correct.
Versus incorrect arrest decision, and this isn’t gonna be correct, as in what is, um, legally justified. So was there probable cause or not? That doesn’t apply here. We all know that you could be below the legal limit and an officer can arrest you for being less safe, and it’s ultimately a correct decision.
But for the the validation studies, what constituted a correct arrest decision was number one, the top left box. If the person was at or above the legal limit and the officer arrested them, that’s obviously correct. That’s what we want to occur. A person that’s supposed to go to jail goes to jail, bottom right.
This is a person who is below the legal limit and the officer releases them. That is also a correct decision. Stop, right, is a false negative. So this is a person who is at or above the legal limit should go to jail, but the officer incorrectly lets them go. In my opinion, these are things, these are the people that we absolutely do not want to have happen on the side of the road.
‘cause you don’t want the officer to suspect somebody’s DUI investigate them for DUI let them go. And worst case scenario, the person drives down the road and kills somebody after they’d already been investigated for DUI. So we don’t want false negatives on the side of the road. Now, bottom left. We also don’t want these on the side of the road, but it’s better to have them on the side of the road than having false negatives, but we absolutely do not, and in my opinion, in United States of America cannot have false positives in court.
We can’t have innocent people being convicted. Of a crime they did not commit. So these are people that were below the legal limit and were incorrectly arrested by the officer. So again, as we can see, the correct decisions come down solely to what the person’s BAC is.
[00:12:07] Understanding False Positives and Test Accuracy
Joshua Ott: What is a false positive? It is a test that incorrectly indicates a condition exists when it actually does not.
An easy way that I explain this is if you went to the doctor. The doctor tested you for a disease and the test came back and said that you have the disease, but in reality you don’t. That will be considered a false positive. So the last thing I wanna discuss before we get into the actual studies is understanding how false positives are calculated, because one of the things that you’ll see nitsa rely on is the overall accuracy of a test.
And in my opinion, this can be a very, very misleading statistic. So what we’re gonna have is a hundred people who are going to the doctor to get tested for a disease outta these a hundred people, 99 of ‘em actually have the disease, all 99 test positive for the disease. So the correct positive rate for this test is 100%.
It got it right on 99 out of 99 people. Now the last person does not have the disease, but they also test positive for the disease. Test was wrong on one out of one people. So the false positive rate is 100%, but the overall accuracy rate of this test was right on 99 out of of a hundred people. So it is 99% accurate overall.
And so if a doctor gave you this test and said, Hey, you just tested positive for this disease, and the test is 99% accurate. You’re gonna be scared because you’re like, oh my God, I just positively was positive for this disease. But then if you actually looked at the data, what it would tell you is that this test cannot discriminate anything.
It comes down to the whole idea of even a broken clock is right twice a day. That regardless of what the person’s condition is, the test is always going to be positive. Now, obviously this is a very small sample size, but if we blew up the sample size and had, let’s say, 10,000 people and we ran the exact same numbers, we realized that this test is absolutely worthless.
So that ultimately is, is showing the problems with, talk about the overall accuracy of a test because it can be very misleading and to me. What we should be focused on is what is important. So in, in my opinion, when we’re talking about beyond a reasonable doubt, we should be worried about what is the false positive rate of this test.
So how often is the test incorrectly indicating that a person is at 0.08 or above? But in reality, they’re actually below that level. That’s what I think we should be focused in on. So now let’s get into the Colorado Field validation study. So again, this is the first full study using experienced officers.
Now the officers were 86% correct in their arrest release decision based on the field sobriety test. Again, that is using the threshold of 0.05 BAC. 93% of the people arrested had a BAC of 0.05. This is exactly the only information the officers received about this study, and according to the actual SFST training, here’s some additional information they don’t learn about.
Number one, the false positive rate was 24%. So one out of every four people below a 0.05 was a false positive in this study, which I, I believe is a significant false positive, especially at that low of A BAC. Now, additionally, the only time that you really might hear about this study mentioned. Is if a defense attorney is making an argument of that, the, the weather conditions that the walk and turn and one leg stand were performing were issues, or the surface was a grade or it wasn’t a smooth flat surface, things like that.
Well, one of the things that I remember teaching officers when I was training them was that if you can do field sobriety in Colorado. Where can’t you do it? Because the way I picture Colorado is I imagine, you know, mountains, snow, windy, cold conditions. So like, you know, if, if you can do the test there, where can’t you do it?
And that’s really the information that the, um, authors of this study proposed with the test is that they said that, you know, there was really no effect that the weather or the surface conditions had. But the problem is, is they did not do any controls. So to make that statement, you actually have to have data and science and research behind it.
So what I believe should have occurred is you have the person do the test in a perfectly controlled environment, and now take them to a bad environment of, you know, snow cold. On level surface, things like that, and compare the results of the test. So in the controlled environment, we had zero clues out in the field.
We had two clues. Obviously the field had an effect on the test, or if there were zero clues in the controlled environment and zero clues in the field, then yes, there was no impact. But again, this is one of those times where Nitsa makes statements without any science data or research to actually back it up.
Next study, um, is the Florida Field Validation Study. You really don’t hear about this one at all, but ultimately, the officers were 95% correct in their arrest decisions. This was the first study dealing with a, b, a C of 0.08 or an above. The false positive rate of this study, which officers also do not learn about was 18%.
So basically one out of every five people that could be a false positive was. And if you notice, the common theme here is the false positive rate is usually running about 20% or above For all of the studies that we’ve so far mentioned.
[00:18:02] Detailed Analysis of the San Diego Study
Joshua Ott: And then the last study, which is the most important, and the reason that this is the most important study is because if an officer is testifying about the accuracy of the standardized field sobriety test, it is almost guaranteed that this is the study that they are testifying about.
Every so often you might hear an officer testifying to the original numbers, um, but the new training. Has been out for so long that, um, they started really hammering into the San Diego study in the 2013 manual back in the 2000, um, sick manual. It was addressed but not as harshly and and as much as it is now.
So basically, you’re almost always gonna hear officers referencing the study. Overall officers were 91% correct in their arrest decisions. Now, this doesn’t mean that if you take HGN walk and turn and one leg stand and add them together, that it’s 91% accurate. What this means is that when the officers got to the end where they made their decision of whether or not to arrest the person, whatever they based that arrest decision on, they were ultimately right 91% of the time.
And then also this study stated that HGN is the most accurate of the three standardized field sobriety tests. The last information that they provide officers about this study is that they break down the accuracy of each one of the tests. So for HGN, the criterion was still four more clues, but now it would indicate a 0.08 or higher.
So using that, it was 88% accurate. Walk and turn and one leg stand. It was still two or more clues on each of the tests, but they would now indicate 8.08 or higher. And walk and turn was 79% accurate and one leg stand was 83% accurate. So now let’s dig into the study, and this is the information that officers don’t learn about with the study is number one that they used seven officers from the San Diego Police Department’s Alcohol Enforcement Unit.
And to my understanding, this is basically their DUI task force. These officers are already trained and experienced and administering the field sobriety test. But even with that, they received a four hour refresher course. Again, it was taught by Dr. Burns who, number one, is making sure that the officers are administering the test correctly.
‘cause that’s a very important key to these tests. They have to be administered correctly. Number two is she was teaching them the new criteria for the test. That the clues on HGM walk and turn in one leg stand instead of indicating a 0.1 or higher would now indicate a 0.08 or higher. So again, when we get into the the robustness study, remember that it’s Dr.
Burns who’s the one that actually teaching the officers that the standard is four more clues, indicates a 0.08 or higher. That’s gonna be very important and relevant. Now after they received this four hour refresher course for the next several months, the officers who were involved in the study went out and made traffic stops on the general public.
So they were doing the same thing A DUI task force officer does on a normal night of looking for DUIs. If they observed any objective signs that the driver had consumed alcohol, they were to administer the three standardized field sobriety test on them. They didn’t have to observe any indicators of possible impairment, just that the person had consumed alcohol, and so they were only able to administer the three standardized field sobriety test.
No other test as well. So now let’s dig into the numbers. So number one is that there was a total of 297 drivers involved in the final data of this study. There was one additional driver who was initially involved in the study, but they were removed because they refused to submit to any chemical testing.
Now, in my opinion, that’s an absolute correct decision by the auth authors because if we’re determining the accuracy of these tests based on the person’s BAC, then we have to know what the person’s BAC is. So that’s not an issue. Here’s where the issue is, and this is my first red flag of this study, and there are many, but there are almost 300 drivers involved and only one, one.
Refuse to submit to any chemical testing that is a percentage less than. 1% of drivers refuse chemical testing. Now, in my experience, both as a law enforcement officer and now being an expert, I would say it’s around 50% of drivers refuse to submit to the chemical test. So this is one of those times where you immediately just raise your eyebrows and say, this doesn’t match up.
How did they get that many drivers to consent to testing? And these are people legitimately being arrested for DUI. This isn’t one of those studies where. NHTSA says that, um, if you blow over, we’re not arresting you. You know, the roadside studies, things like that. These are people legitimately being arrested for DUI.
Now, additionally, of the 297 drivers, their average BAC was a 0.12 of the drivers arrested. Their average BAC was a 0.15, and if the driver’s not arrested, their average BAC was below a 0.05. The reason that I’m bringing up all of these numbers is because we are trying to validate the test at their ability to discriminate who is at or above this threshold of 0.08 from those who are below it.
And so it stands to reason that the further a person gets away from this line, the easier we would expect the officer’s arrest decision to become. So let’s take a person with a 0.15 B, a c, almost two times the legal limit. There’s a good chance that this person is showing gross signs of intoxication, of slurred speech, unsteadiness, staggering, things like that, that before the officer ever gets into the field sobriety test, they’re already suspecting that, or maybe they already know that they’re arresting this driver for DUI.
And then we do the the field sobriety test, and lo and behold, the person gets arrested or on the other side of the coin. We have the person who is below a almost half the legal limit. Maybe the only thing that they’re showing is an odor of an alcoholic beverage. So before the officer administers the field sobriety test, they already highly suspect that they are not arresting this driver for DUI.
And this is a big additional issue with not just this study, but almost all of the studies that have been conducted in the field sobriety test is that the wide range of Bacs likely makes it easy on the officer’s arrest decision. And so what I believe would occur is if we took the people and had them between a 0.06 and a 0.1, the overall accuracy that occurred in this study would drastically lower because in my experience, that is the hardest area for an officer to be able to discriminate between of people that are, are of that very marginal BAC limit of are they slightly over or are they slightly under now?
What were the false positive rates for the San Diego study? For HGN, it was 37%. This means that one out of every three drivers, um, who was below a 0.08 had four more clues, walk and turn. It was 52%. So this means that if you were below a 0.08, it would be statistically more accurate for an officer. Flip a coin as to administering that test.
One leg stand, 41% false positive rate. So again, it’s almost as statistically accurate for an officer to flip a coin as to administering that test. And then lastly, when the officers made their ultimate arrest decision, the false positive rate was 28%. So this means that more than one out of every four drivers, who based on the parameters of this study, should not have been arrested, were incorrectly arrested by the officers.
Now. Additionally, officers did have access to PVTs, but NITSA basically assures us that these PVTs did not impact the officer’s opinions. Now, how they assured that is that they had officers fill out this form. And so as you look at the form, what you can see is that there happened a list, how many, many clues they observed on each one of the tests as they’re doing it, and then.
We get to number four where the officer then has to write the estimate of the person’s b, a, c, and then below that they write the time of the estimation. Then we get to number five, and number five is where they write the the PBT result and the time of the PBT result. So by ensuring that the time listed on number four was before the time listed on number five, that’s how we assured that officers did not use the PBT to impact their opinions whatsoever.
To me, that’s just insane because. To trust the fact that officers could not have fudged the numbers and used the PBT looked at their watch, said it’s 1215 and then written 1214 for the estimation, um, is putting a whole lot of trust in the officers. And now let’s look at the data. That definitely brings questions into this.
And so what we have is on HGN, there were 30 people who were false positives on that test. Now in the study they addressed that. Officers mentioned that HGN was the silver bullet, and I know I was trained and I trained other officers that the eyes never lie. The eyes are the window of the sole. And I, as well as other officers put a lot of weight on HGN to make their arrest decision.
Of these 30 people, the HGN is telling officers to arrest, how many were actually arrested? Only 16, almost 50% of the people that HGN is telling officers to arrest for some reason were not arrested. Now, is it possible that these people did perfect on the walk and turn and one-leg stand and the officer just had doubts about it?
Yes. That’s absolutely possible. To have that occur on almost 50% of the cases is definitely questionable. And if ultimately, and even just one case, the officer used a PBT to influence their opinion, it compromises the data of the whole study. Now, on top of that, it also makes us question, was it possible that there were even more than 30 false positives and the officers used the PBT and after getting a PBT result.
Wrote that they observed two clues on HGN or zero clues on HGN. This is, again, something we will never know, but like I said, the data definitely brings questions into if the PVTs impacted the officer’s opinions. Now, what are officers taught about the false positive rates from the San Diego? And all the studies in general.
So number one, yes, they are taught that false positives do occur, but they’re not taught what the rate of false positives are. And even if the officer went above and beyond their training and actually read the San Diego study, which is not required reading, they would still not learn what the false positive rate is because it’s not actually published in the study.
[00:29:25] Understanding False Positive Rates
Joshua Ott: The only way to determine the false positive rate is for the officer to take. The matrixes that they provided in the study with the raw data and calculate the false positive rate for themselves. So number one, they would have to understand how to calculate a false positive rate. Number two, they would have to really ultimately question what they’ve been taught and say, Hey, I just wanna see these numbers for myself and go above and beyond and do that.
[00:29:52] Misleading Training in ARIDE Manual
Joshua Ott: But here’s where it becomes the, the biggest problem of all is that in the ARIDE instructor manual. This box right there at the bottom that I have highlighted, it teaches officers that research has, um, demonstrated that officers are more likely to err on behalf of the defendant. So ultimately, what they’re teaching officers is that if an error occurs with these tests, it is most likely going to be you letting somebody go who should have been arrested versus incorrectly arresting somebody that should have been released.
Let’s look at what the data actually says.
[00:30:28] San Diego Study Data Analysis
Joshua Ott: So this comes from the San Diego study. This is the, the box actually dealing with the, um, overall estimate of the person’s BAC by the officer. So this is the officer’s ultimate arrest decision. So what we see is that there was 214 drivers involved in the study who had a, b, a, C of 0.08 or above.
Out of those 214, only four of them were not arrested. Which gives us a false not negative rate of 1.9. Let’s call it 2%, so 2% false negative rate. In the San Diego study, 83 people in the study were below a 0.08. Out of those 83, 24 of them were false positives. That gives us a false positive rate of 28.9%.
So ultimately. It shows that it is 14 times more likely that a false positive is going to occur than a false negative, and officers are trained the exact opposite thing. So this obviously we can see the snowball effect that this has, that an officer standing on the side of the road. They know that they’ve been taught, Hey, if these tests get it wrong, it’s going to get it wrong to the benefit of the the violator.
So I have leeway on this test that, Hey, the test is telling me to arrest the person. It’s not going to be wrong that way. So I’m going to go ahead and arrest. And the reality is, it is clear that false positives, far outpace false negatives. Additionally, like we’ve already stated, the sssts are not validated for impairment.
The San Diego study makes this very, very clear. It states that the only appropriate criteria and measure to assess the accuracy of the standardized field sobriety test is BAC, and measures of impairment are irrelevant because the test must be correlated with BAC rather than driving performance.
[00:32:24] HGN Test Validity Issues
Joshua Ott: It also talks about HGN having a lack of face validity, which means that HGN is not related to actually needing to operate a vehicle safely.
And it states that, um, that is not the purpose of the test. The purpose of the test is to, again, discriminate who is at or above the legal limit from those who are below. And then additionally, it again says that it is, um. Lacks face validity because it’s not required to operate a vehicle safely. And it’s saying that that reasoning is correct that they are in this study saying that HGN is not a requirement of operating a vehicle safely, but.
It is based on the incorrect assumption that that’s what the tests are designed to measure. It is very clear throughout this study that these tests are not validated for impairment and yet every single day, like I said, officers are in court testifying to the exact opposite, and now I’m starting to see officers testifying that horizontal gaze nystagmus doesn’t just indicate impairment.
It actually is impairment. There is absolutely no research that they have to back up those statements. And the research that they’re, they’re using for these tests in general actually says the exact opposite is that HGN is not required to operate a vehicle safely. And these are the problems is that officers are able to basically get away with this testimony.
Um. Number one, because I, I don’t think judges understand how to truly gate keep this information. But number two is that the lie has been told for so long that everyone thinks that it’s the truth that these tests indicate impairment. So this is the official name of the San Diego study. So it’s the validation of the standardized field sobriety test battery at Bacs below 0.1%.
[00:34:15] Peer Review and Credibility Concerns
Joshua Ott: Number one is this is not peer reviewed. The San Diego study has never been peer reviewed, which brings a serious question into why is this study then utilized to justify the standardized field sobriety test for a Dalbert standard or a fry standard when it has no peer reviewed to it. But there is a stu, oh, there is a paper written about the San Diego study that has been peer reviewed, and the name of that paper is what you see written below.
And if you look at that and you look at it quickly and you don’t truly compare the official name of the San Diego study in this paper, they look exactly the same, or at least they sound the same. And so I think that this confusion has led a lot of people to believe that the San Diego study has actually been peer reviewed.
The paper was only authored by Dr. Jack Ster, who was one of the two authors with Dr. Marline Burns, who, um, authored the San Diego study. But it brings into question, why was Dr. Burns not one of the authors of this peer review paper? And then on top of it, why would you need to write a paper to get peer reviewed and not just peer review the whole study.
You put all this work and effort into this whole study. If there’s no issues in the study, why would you not have the, the study peer reviewed? So I think those are major questions as well about the the San Diego study. Some additional things is the false positive rates were not published in the study.
So again, the only way to determine what the false positive rates are is to go in, look at the data and calculate it for yourself. And then they made ex made multiple statements to explain false positives. The first statement is they said that in several cases the officers were correct in identifying impairment.
Number one is they provide absolutely no science data or research to back up the statement. Number two, if we remember what I just talked about, that the authors are the ones that state the only objective criteria to measure or assess the accuracy of the standardized field. Sobriety test is BAC and measures of impairment are irrelevant.
Well, it appears that we only. Apply that rule. When we get the results that we want from the study, when we get false positives, now all of a sudden we throw that rule out the the window, and on top of that, if we’re going to err to the side that if an officer says a person’s impaired, they’re impaired, then there was no need for the study in the first place because we don’t need to study it.
We’ll just always say the officers are right. So obviously that’s an absolute. Insane statement that they’re making in that. And then number two is that they say case number 16 was a juvenile who was a 0.069, and that rendered the difference between their estimated BAC and measured BACS irrelevant and a zero tolerance jurisdiction.
That is, it was a correct arrest decision despite the BAC estimate. Here’s the deal. Yes, that was a correct arrest by the officers. The legal limit. This person was above the legal limit as a juvenile. But the test was still wrong. If the test is designed to discriminate, if a person is a 0.08 or above and it’s saying that this person was a 0.08 or above and they weren’t, then the test is wrong.
So again, as you see, they’re just making multiple, um, justifications of the false positives instead of just in a non-biased fashion reporting. Here’s the false positives based on our objective measurement. And then digging further into it and further studies to try and lower these false positives. But ultimately, NITSA has just sat on their laurels and not appeared to have any concern with what the false positive rates are of these tests whatsoever.
And obviously they don’t go out and ever tell anybody what the false positive rates are. And then the last thing that kind of comes out of the San Diego study, but it doesn’t officially come outta the San Diego study, but when an officer testifies to HGN and they say that it’s 88% accurate, the courts apply that across the board that this officer is just as accurate as any other officer.
Even when the officer is clearly not knowledgeable on the stand. We treat them all the same. So should that accuracy rate apply to all officers.
[00:38:36] Boating Under the Influence Study
Joshua Ott: This study, which was done for boating under the influence investigations when they were creating the seated battery, really illustrates that all officers are not even close to the same with their accuracy rate.
And what it found was it had, the officers had an average of almost 10 years experience administering the roadside standardized field sobriety test. So again, we’re not even talking about an um, inexperienced officers. We’re talking about only very experienced officers. How accurate were they? 20% of the officers involved in this study were less than 50% accurate.
With HGM. That is insane. And yet in court we’re treating all officers exactly the same every single time. And it’s not one of those where, you know, some officers were 88% accurate and some were 85% accurate. We are talking about less than a coin flip of accuracy for 20% of the officers in this study.
[00:39:35] Robustness of HGN Test
Joshua Ott: So now moving on to the robustness of the Horizontal Gaze Nystagmus Test.
This study was published in 2007. Again, it was funded by the National Highway Traffic
Safety Administration and authored by Dr. Meline Burns. So the purpose of this study was to address defense attorney arguments that when HGN is not administered correctly, it compromises the accuracy of the test. So what they did was they used, again, experienced officers, a total of seven officers were used in this study, same number as San Diego, and in a laboratory they dosed volunteers to different blood alcohol concentrations.
They then had the officers administer HGN to those volunteers. Now, there were three parts of the HGN test that they were looking at. To see if when an officer did it correctly versus incorrectly if and how it would affect the results. The first was the speed for lack of smooth pursuit, so they had one officer do it at the correct speed.
One officer did it too fast. They looked at the height of the stimulus, so one officer held at what they considered the standard of two inches above eye level, which is gonna be right about the center of a person’s forehead. They had one officer hold it above the standard elevation, which was considered four inches above eye level, right about the top of a person’s head, and then another officer held it below the standard at eye level, so zero inches.
And then the last thing they looked at was the distance of the stimulus from the person’s face. So correct was 12 to 15 inches, too close, was 10 inches, and too far away was 20 inches.
Again, as you can see, these are the, the things that they were testing. What were the results? Just looking at the times in which the test was administered correctly, the false positive rate was 67%. So two out of every three people below 8.08 had four more clues on HGM. Don’t worry, it gets even worse.
Below a 0.05, 65% of the people had still had four or more clues on HGM and where it gets very disturbing, and for some reason very weird, is that at the lowest Bacs people below a 0.03, 85% of them had four or more clues on HG. So the false positive rate, for some reason actually increases. At the lowest BACS versus in the range that’s just slightly below, um, the 0.05 range.
So for some reason it gets super high below a 0.03. Where do you find this,
Aaron Olson: Josh? In the paper.
Joshua Ott: What’d you say? Where do you
Aaron Olson: find that? Uh,
Joshua Ott: is it in the table that she provides? No, no. We’ll get into her, her crappy table in a second. Okay. Um, but it’s actually taking the, you have to take the, the actual, um, raw data and look at each individual person.
And so you’re gonna have to take the people that are a 0.029 and a and below and just take that total number of people, which you’re gonna get seven people who the test was administered to correctly. And then when you look at what their BAC is and how many clues they observed. Or when you look at how many clues are observed, that’s where you’ll get that.
There were six of them out of the seven that, um, had four more clues below a 0.03. Additionally, um, six outta six clues was observed at a BAC as low as 0.029, which is very consistent with the San Diego study, which had six outta six at a B, a C as low as 0.028. Now looking at the variance of stimulus positioning.
When the stimulus was held too high, it was a 91% false positive rate. So basically the test is almost completely worthless because it’s almost gonna always be positive when the stimulus is held too high, regardless of what the person’s BAC is when it’s held too low. 79% false positive rate. When it’s held too close, again, pretty much worthless at 92% false positive rate.
And when held too far away, 84% false positive rate. Number one is I think that this should make us say, why is HGN still treated as this great test when the false positive rates are through the roof, even when administered correctly? And then number two, why is the manual not very specific to officers of, hey, you have to position the stimulus correct, because if you do not, the false positive rate becomes extremely high.
Well. That’s because Dr. Burn’s ultimate statement in this study was the HGN is a robust procedure. Now, a couple things to point out with that. She does not say accurate, and she does not say reliable, but what she means by robust is ultimately, regardless of how you administer the test, it’s not going to compromise or affect the accuracy of the test.
How does Dr. Burns accomplish that after you just saw all the data? Well, we already talked about it multiple times. Four more clues indicates a B, a C of 0.08 or higher. That has been the standard since the 1998 San Diego study. And remember, Dr. Burns is the one that taught officers this standard, so there’s no chance that Dr.
Burns doesn’t understand the standard as of the 2025 edit to the SFST manual. This is still the same standard that applies today in this study. Dr. Burns changed the standard, and this is, um, copy and pasted directly from the study. I’ve highlighted the key areas. She stated that the criteria by which scores have been classified as correct, false, negative or false positive, and this is.
The, the biggest lie possible is that she says, as defined in the SFST curriculum, appear below, and that’s the box that she applies below. That box has never applied in any SFST curriculum. Dr. Burns, I can’t say that she came up with the curriculum, but she’s definitely. Intimately familiar with the curriculum and probably was helping create it if she didn’t create it on herself.
So she knows that this does not apply in the curriculum, and she knows that this is not the standard. But as you can see where the two arrows are. She says that four clues can indicate a, B, a, C of 0.03 or higher. So by doing this, it drastically lowered the number of false positives that she had to report occurring in this study, and that is how she was able to come out with all the statements that she does in this study.
That’s why we don’t have to talk about the 67% false positive rate. That’s why we don’t have to talk about the, the stimulus has to be positioned correctly because she fudged the ultimate data. And so in my opinion, what occurred in this study is that Dr. Burns took her opinions and then took the data and changed the standard to make it fit her opinions instead of.
Taking the data, applying the correct standard and forming her opinions on that. And so this brings serious questions into the credibility and integrity of Dr. Burns. And if that is an issue, all of a sudden we go back to these five studies out of six that were used to develop and validate the field sobriety test, and now we have an issue with the credibility and integrity of the author of all of those studies.
Now what happens with the robustness of the horizontal gaze nystagmus test when we talk about it in court, what’ll happen is they’ll talk about that. In 2018, the Drug Recognition Expert Tap Committee, the technical advisory panel, decided to retract this any mention of this study from all of their manuals.
So now if you mention this study, you’ll get attacked in court that you’re mentioning a study that has been retracted. Number one, it has only been retracted from the manuals. It has not been officially retracted. But number two is that it’s very self-serving of the DRE TAP committee to retract this study.
Because ultimately, in my opinion, this is a very, uh, this study is very exculpatory that. They are removing exculpatory evidence from being allowed. And what do we know from this study that we can verify? Number one is we have experienced officers. Number two, we know that the officers administered the test correctly.
Number three, we know what the person’s BAC was, and number four. We know how many clues the officer observed on that person. The numbers are what they are. This is the number of false positives that occurred, any which way you wanna look at it. And the biggest issue that I have is if you want to attack the study, then you have to attack the author of the study as well.
And they refuse to do that.
[00:48:47] 2023 Field Sobriety Test and Cannabis Study
Joshua Ott: So now the last study that we’re gonna talk about is the most recent, um, that I’m aware of is the 22, 20 20. 2023 field sobriety test and cannabis study. This was published in the Journal of the American Medical Association. To my knowledge, it’s a peer reviewed journal, very important distinction to the San Diego study that, again, was never peer reviewed.
And one of the questions you might have is, well, why are we talking about a cannabis study for DUI tests in general? And that’s because in this study there’s a placebo dosed group. So looking at the placebo dose group, we’re able to see how sober people perform on the field sobriety test. So what they did was, in this study, they used marijuana users who, before being allowed in the study, they had to be screened both physically and mentally.
And then on the day of the study they were given, um, a drug and alcohol test. Now after that, they had to perform on a driving simulator. And then after performing on a driving simulator, they were broken up into three different groups. One was a placebo dose group, and then the other two groups, one received a high dose of THC, and the other group received a low deal dose of THC.
We’re not gonna focus in on those two groups, just the placebo dose group. Then for the remainder of the day, they had to perform both the driving simulator and have multiple field sobriety tests administered to them throughout the day. The evaluators were certified drug recognition expert instructors.
So the top level of training that officers can have for DUI enforcement are the officers that they utilize in this study. These are the tests that were administered. So every one of the field sobriety tests, minus horizontal gaze nystagmus, was used for this study. The reason that horizontal gaze nystagmus was not used is that it’s not expected to be present for somebody who’s possibly under the influence of marijuana.
So it was removed from the study. Or not used in the study. And ultimately we have 63 placebo dosed individuals. Now, the one question we do have to ask ourselves is these are marijuana users. So even though they got placebo doses, were they high or were they having residual effects when they showed up?
Thankfully the study answers that question for us, and what it stated was that based on that initial driving simulator that occurred prior to any dosing, there was no evidence of these people having residual effects from prior marijuana usage. So all of the evidence of this study indicates that these people were sober when the field sobriety tests were administered to them.
So how did they do? Walk and turn 56% false positive rate. So when you are sober, it is statistically more accurate for an officer to flip a coin as to administering the walk and turn. This is very consistent with the 52% in the San Diego study. Next, we get to the one leg stand where it’s 37%. So one out of every three sober people was a false positive on this test as well.
Again, very consistent with the 41% false positive rate in the San Diego study. And then lastly, when combined, the false positive rate was 28%. So more than one out of every four sober people were false positives on both the walk and turn and when they stand. But I think ultimately, yes, this gives us data to prove it, but majority of people that are thinking with common sense already know that sober people are going to have difficulty on the walk and turn, and when they stand, this just gives us data to support that.
Now what’s kind of funny in this study is they, they make this statement of that they were kind of surprised that, you know, with officers knowing that there were gonna be placebos, that basically every time the person was ruled FST impaired on the field sobriety test, the officers believed it was because they were under the influence of cannabis.
They were surprised by that. And I just think it’s funny because if you go through the training, especially ARIDE and DRE, that’s basically what they’re hammering home is that when these tests tell you to take somebody to jail, the test is right because they indicate impairment when obviously we know that they don’t.
And they’re surprised when officers arrest people because they do poorly on the field sobriety test. And again, as we can see, normal sober person, sober people are going to have issues on these tests.
[00:53:17] Final Thoughts on Field Sobriety Tests
Joshua Ott: So my final thoughts are, number one, the SSTs have not been validated indicated impairment, but that testimony is provided throughout the United States daily and every time as an expert.
When you get on the stand and you talk about the field sobriety test, this is a mistake that cannot be made. You cannot at all make that slip up of saying impairment in regards to these tests. Do these tests indicate that there’s possible impairment? Yes, but they do not actually indicate impairment. And you know, every time I teach the standardized field sobriety test to course to attorneys, I’m making sure to hammer that point home because if judges are hearing it, regardless of who they’re hearing it from, it’s strengthening.
That lie that’s been told for a long time and it again, it’s a huge uphill battle that we have to fight and we might never be successful in it because it’s been told for so long, but we can’t have slip-ups and have anyone saying that these tests indicate impairment because again, they clearly do not.
The tests have very significant false positive rates, and the officers, prosecutors, and judges are not aware of them. And again, this is a huge problem that so much reliance is put on these tests. And every time I go to court and I’m testifying to the false positives, I am in a knockout, dragged out fight with the prosecutor.
And a lot of times the judges aren’t even believing it because this is contrary to what they’ve been trained for so long. And then lastly is that the JAMA study provides a lot of insight into how a sober person performs on the walk and turn in one leg stand, but ultimately, many elements. Still remain unaddressed age.
We all can probably agree that 40 year olds, 50 year olds, and 60 year olds cannot balance and do things with their balance as well as what they could when they were in their twenties. But we treat age 21-year-old the exact same way that we treat a 64-year-old on these tests. Additionally, just when I teach these classes and I’m demonstrating the walk and turn, the amount of difficulty I have now in my forties maintaining the instructional position versus when I was an officer in my twenties and thirties and I could just stand there all day is very eye-opening.
Additionally, we have no idea how weather actually impacts the test. What about surface conditions? Footwear, A person wearing tennis shoes versus a person wearing flip flops or heels that they choose not to take off or barefoot. All of those things could possibly impact the test and likely impact the test.
And yet we treat people exactly the same regardless. And then lastly is injuries. Yes. The original research states that people would lag back or enter ear problems, would have difficulty on the test, but the argument that’s always made to that is, well, they didn’t actually research it. Well then let’s research it.
Let’s find out what impact all of these different things have on the test because. Yes, we have the JAMA study, but those people were screened ahead of time. Those are, you know, the, the best of best of society physically and mentally. That isn’t a lot of times what officers are encountering on the side of the road.
So what happens to the false positive rates when we’re dealing with people with other conditions or in other surface or weather elements? Any questions?










