Get notified about new articles - join the ExpertPages Mailing List now
Voiceprint identification can be defined as a combination of
both aural (listening) and spectrographic (instrumental) comparison of one or
more known voices with an unknown voice for the purpose of identification or
elimination. Developed by Bell Laboratories in the late 1940s for military
intelligence purposes, the modern-day forensic utilization of the technique did
not start until the late 1960s following its adoption by the Michigan State
Police. From 1967 until the present, more than 5,000 law enforcement related
voice identification cases have been processed by certified voiceprint
Voice identification has been used in a variety of criminal
cases, including murder, rape, extortion, drug smuggling, wagering-gambling
investigations, political corruption, money-laundering, tax evasion, burglary,
bomb threats, terrorist activities and organized crime activities. It is part of
a larger forensic role known as acoustic analyses, which involves tape filtering
and enhancement, tape authentication, gunshot acoustics, reconstruction of
conversations and the analysis of any other questioned acoustic event.
The fundamental theory for voice identification rests on the
premise that every voice is individually characteristic enough to distinguish it
from others through voiceprint analysis. There are two general factors involved
in the process of human speech. The first factor in determining voice uniqueness
lies in the sizes of the vocal cavities, such as the throat, nasal and oral
cavities, and the shape, length and tension of the individual's vocal cords
located in the larynx. The vocal cavities are resonators, much like organ pipes,
which reinforce some of the overtones produced by the vocal cords, which produce
formats or voiceprint bars. The likelihood that two people would have all their
vocal cavities the same size and configuration and coupled identically appears
The second factor in determining voice uniqueness lies in the
manner in which the articulators or muscles of speech are manipulated during
speech. The articulators include the lips, teeth, tongue, soft palate and jaw
muscles whose controlled interplay produces intelligible speech. Intelligible
speech is developed by the random learning process of imitating others who are
communicating. The likelihood that two people could develop identical use
patterns of their articulators also appears very remote.
Therefore, the chance that two speakers would have identical
vocal cavity dimensions and configurations coupled with identical articulator
use patterns appears extremely remote. While there have been claims that sever
al voices have been found to be indistinguishable, no evidence to support such
allegations has been published, offered for examination or demonstrated to the
Several studies have been published evidencing the ability to
reliably identify voices under certain conditions, and a Federal Bureau of
Investigation survey of its own performance in the examination of 2,000 forensic
cases revealed an error rate of 0.31 percent for false identifications, and 0.53
per cent for false eliminations. (See Koenig, B.E., 1986, Spectrographic Voice
Identification: a forensic survey, Journal of the Acoustical Society of America,
While there is disagreement in the so-called "scientific
community" on the degree of accuracy with which examiners can identify speakers
under all conditions, there is agreement that voices can, in fact, be
To facilitate the visual comparisons of voices, a sound
spectrograph is used to analyze the complex speech wave form into a pictorial
display on what is referred to as a spectrogram. The spectrogram displays the
speech signal with the time along the horizontal axis, frequency on the vertical
axis, and relative amplitude indicated by the degree of gray shading on the
display. The resonance of the speaker's voice is displayed in the form of
vertical signal impressions or markings for consonant sounds, and horizontal
bars or formants for vowel sounds. The visible configurations displayed are
characteristic of the articulation involved for the speaker producing the words
and phrases. The spectrograms serve as a permanent record of the words spoken
and facilitate the visual comparison of similar words spoken between and unknown
and known speaker's voice.
The acoustic environment in many cases can be controlled at the
receiving end of speech signal. Shutting off the radio, television or other
signal- noise generating devices will reduce or eliminate unwanted background
speech or noise. While not always possible, the investigator should at tempt to
select a reasonably quiet environment for controlled activities such as drug
buys or other illegal operations being investigated. Many times these types of
activities are carried out in bars, restaurants, car washes, billiard rooms and
the like, and the investigator cannot always dictate the location.
It may require the recording of telephone conversations or
face-to-face encounters under a variety of acoustic conditions in which someone
is wearing a body recorder or transmitting the conversation via radio frequency
to a remote location. Unfortunately, in many cases the investigators cannot
control the acoustic environment. In situations involving an adverse
environment, investigators should use high technology stereo equipment to
optimize recording capability.
The attempt to produce samples as parallel to the unknown as
possible actually assists the examiner in his task because speaker variables are
reduced to a minimum. Numerous studies have been conducted that indicate very
reliable decisions can be made by trained professional examiners when samples
are obtained in the manner described.
The notion proposed by some opponents that duplicating the
unknown as closely as possible may cause error is not supported by any available
evidence. Research studies have produced strong evidence that even very good
mimics cannot duplicate an- other's speech patterns.
In an attempt to obtain proper speech samples, investigators
should not hesitate to ask suspects for the samples they need. Surprisingly,
many suspects will voluntarily give a sample of their voice for comparison
In the event you are dealing with some type of vocal' disguise,
attempt to obtain a similarly produced known exemplar in addition to the
suspect's normal voice. It should be noted that vocal disguises can be very
difficult for the examiner to deal with and the probability of determination is
less than with normal voice samples.
If a suspect refuses to cooperate with the investigator, a
court order may be acquired compelling the suspect to produce voice recordings
for the purpose of comparison. Courts have repeatedly held that requiring the
accused to submit voice exemplars for the purpose of comparison for
identification or elimination does not violate the suspect's Fifth Amendment
rights. In Wade, 388 U.S. 218 (1967), the Court held that the privilege against
self-incrimination offers no protection from compulsion to submit to speaking
for purpose of voice identification, or to writing, photographing, finger-
printing and measurements.
Several problems have been encountered in obtaining known
voice exemplars even with the use of a court order. If the court order is vague,
the suspect may utter a few words of the text involved, speak too softly, too
fast, or too slowly, or otherwise disguise the sample and claim compliance with
To prevent such problems, the investigator is wise to request
that the court order specify in detail, that the suspect give a sample of his or
her voice, repeating the phrases of the questioned call in a natural
conversational voice (or in a similar disguise, if that is the case) and that
such sample shall be given at least three times and to the reasonable
satisfaction of the investigator. Voice exemplars obtained with such specific
instructions are usually very satisfactory for comparison purposes.
Before terminating the recording session, check the recording
to deter mine whether or not a satisfactory exemplar was obtained.' Remember
that once a suspect is released, a second known sample may be very difficult to
Whatever the recording circum stance, background noise and the
distance between the talker and the receiving device should be minimized for
optimal recording. Good quality tape recording equipment should be used, as well
as magnetic recording tape. As a rule of thumb, recording tape with standard 120
equalization, normal bias and no more than a 5 dB drop at 6 KHz should be
After the development of a suspect, the next task is to
properly obtain known voice samples for comparison purposes. Do not hesitate to
ask a suspect for a speech sample. If the suspect refuses, a court order may be
obtained requiring compliance with the request. See Schmerber v. California,
384 US. 757(1966). and Gilbert v. California, 388 US. 263 (1967).
Both are landmark cases. There are also many additional decisions at both
state and federal court levels that may be cited to support such a request.
Court orders should clearly spell out the minimum number of samples to be
obtained, the manner of speech, and the method to be employed.
The next task for the investigator is to obtain proper speech
samples for comparison purposes. Probably the best guide here is attempting to
duplicate the recording of the questioned call. Known samples should be obtained
via the telephone and recorded in the same manner as the questioned call. If
possible, the same recorder and telephone pickup should be used. In some cases,
even the same telephone has been employed. If there is room on the questioned
tape, the known sample may be placed on it. If there is not, another tape of the
same type and brand should be used if at all possible.
Speech samples obtained should contain exactly the same words
and phrases as those in the questioned sample because only like speech sounds
are used for comparison. Be cause the voice, like handwriting, is dynamic and
variant, several samples of each spoken phrase are desired for analysis. Unless
the questioned call sounds like a read statement, the suspect should not be
allowed to read the phrases from a transcript but should repeat each phrase
after it is spoken by someone else. To avoid an unnatural verbal response, the
suspect should repeat the first phrase and proceed in the same manner with each
When all phrases have been recorded, the same procedure should
be repeated at least two more times beginning with the first word or phrase. The
suspect may be asked to read the phrases if a very poor job of repeating is
done. Some people do a better job of reading than repeating the phrases.
It is important that the known sample be spoken in the same
manner as the questioned sample; therefore, the investigator should be familiar
with the voice, manner of speech and the text. If the caller's voice was
disguised, the suspect should give a normal sample and a disguised one as in the
Recorded evidence should be wrapped in tinfoil to protect it
from possible contact with a magnetic field if it is submitted by mail. The
evidence should be shipped in a secure container that will prevent the evidence
from tearing through the packaging material. Do not submit a copy of your
investigative report with the evidence. The examiner does not want to know the
details of the case. It is important, however, to provide the examiner with
information regarding the recording method, the number of calls and suspects
involved, and any other information that may assist the examiner in the
examination of the evidence.
Upon receipt of the evidence by the laboratory, it is properly
marked and a case number is assigned. The analysis and comparison of known and
questioned voice samples may take several hours or days to complete, depending
on the number of samples involved and the complexity of the examination. Both an
aural (listening) and visual (spectrographic) examination and comparison is
conducted. Aural and spectrographic cues examined should compliment one another
in the event the voices are in fact the same.
As with the identification of fingerprints, there is presently
no universal standard for the number of words required for identification. It
does, how ever, vary from a minimum of 10 for some agencies and 20 for others.
The Internal Revenue Service has chose to use 20 or more like speech sounds
between an unknown and known sample with the degree of certainty based on
quality and excellence of the evidence examined. Obtaining a second, independent
decision is standard practice in this field as in other forensic sciences.
Visual comparison of spectrograms involves, in general, the
examination of spectrograph features of like sounds as portrayed in spectrograms
in terms of time, frequency and amplitude. Specific features, the result of
producing consonants, vowels and semi-vowels in isolation or in combination
(co-articulation), include the following but certainly not all-inclusive clues:
pitch, bandwidth, mean frequency, trajectory of vowel formants, distribution of
formant energy, nasal resonance, stops, plosives, fricatives, pauses, inter
formant features and other idiosyncratic and pathological features.
Special aural comparison tapes are prepared facilitating
comparison of psycholinguistic features via short-term memory. Aural cues
compared include resonance quality, pitch, temporal factors, inflection,
dialect, articulation, syllable grouping, breath pattern, disguise, pathologies
and other peculiar speech characteristics.
Some agencies offer court testimony, others do not. The IRS
laboratory is the only federal agency that presently offers testimony. All other
certified examiners, whether in state agencies or in private practice, also
offer court testimony.
Court testimony involving aural- spectrographic voice
comparison essentially started having an impact on the courts after the Tosi
Study in December 1970. Since then there have been between 150 and 200 trials in
local, state or federal courts. Because of a difference based on evidentiary
philosophical reasons, some courts have admitted aural-spectrographic voice
evidence and others have not.
There are two general "rules" or "standards" by which
scientific evidence is accepted in courts of law in the United States. The
first, commonly referred to as the Frye "rule" or "test," is based on a 1923
District of Columbia case and basically requires "general acceptance in the
particular field in which it belongs." See Frye v. United States, 54 App. D.C.
46, 293 F. 1013 (1923). The second is based on the argument of McCormick (See
"McCormick on Evidence," 3rd Ed., 203 at 608.) McCormick states: "General
scientific acceptance is a proper condition for taking judicial notice of
scientific facts, but it is not a suitable criterion for the admissibility of
scientific evidence. Any relevant conclusion supported by a qualified expert
witness should be received unless there are distinct reasons for exclusion." See
Rule 702 of the Federal Rules of Evidence.
Many state and federal courts have abandoned Frye and adopted
the argument of McCormick. The supreme courts of Minnesota, Maine, Ohio and
Rhode Island have admitted aural-spectrographic voice evidence following
McCormick. Intermediate appellate courts in California, Mary land and Michigan
admitted such evidence following Frye but were reversed by their respective
supreme courts, which held that the Frye test had not been met. The
Massachusetts Supreme Court held aural-spectrographic voice evidence admissible
applying the Frye test, while those of Arizona, Indiana and Pensylvania did
In the federal court system, we are aware of 30 trials in which
the question of aural-spectrographic voice evidence was addressed. All but three
admitted the evidence based on Frye or McCormick. On appeal, the Second, Fourth
and Sixth Circuits held the evidence admissible, applying McCormick, while the
District of Columbia did not, applying Frye. See United States v. Williams,
583 F.2d 1194 (2d Cir.), cert. denied 439 US.
1117 (1978); United States v. Bailer, 519 F.2d 463 (4th Cir.),
423 US. 1019 (1975); United States v. Franks, 511 F.2d 25 (6th
Cir.) cert. denie4 422 US. 1042 (1975), and United States v. McDaniel,
538 F.2d 408 (D.C. Cir. 1976).
In United States v. Williams, supra at 1198, the court
said: "The 'Frye' test is usually construed as necessitating a survey and
categorization of the subjective views of a number of scientists, assuring
thereby a reserve of experts available to testify. Difficulty in applying the
'Frye' test has led a number of courts to its implicit modification." Also see
United States v. Bailer, supra at n.6.
Since 1970, the forensic application of aural-spectrographic
voice identification has been reliably applied in the investigation of several
thousand cases. While there is disagreement on the reliability of the method
under all conditions, there is agreement that voices can be identified and
eliminated when the proper conditions exist and the analysis is carefully
conducted by qualified examiners.
Several state appellate and supreme courts have admitted the
evidence, as have three of four federal appellate courts. The United States
Supreme Court has refused to review and decide the three cases brought before
it. While the admission of aural-spectrographic voice evidence continues to be
decided in various courts, the method continues to be a very important tool m
the arsenal against crime.
Other areas of acoustic analysis include, in part, gun shot
analysis, tape enhancement and tape authentication. While not discussed in this
article, it should be noted that laboratory analysis related to these problems
is avail able in some laboratories.
By: Steve Cain