The Receiver Operating Characteristic (ROC) curve is a well-known tool used to evaluate the performance of binary classifiers. Its history is clear. According to Wikipedia, it
"was first developed by electrical engineers and radar engineers during World War II for detecting enemy objects in battlefields, starting in 1941, which led to its name ('receiver operating characteristic')"
Another source also states that the ROC curve was first used during the second world war to distinguish between enemy targets, or just noise by radar receiver operators (utah.edu, 2007).
A recent paper keeps the history more subtle by stating that
"The development of receiver operating characteristic (ROC) curves comes out of signal detection theory, which arose in part as a method to improve the accuracy of radar detection during World War II." (Junge & Dettori, 2018)
And they use two citations for this. One does contain curves (Marcum, 1947) but since none of the curves plot the true positive rate against the false positive rate, they are not ROC curves. The other does mention ROC curves (Peterson et al., 1954, Section 2.7), but not its history. So maybe the history is actually not clear?
It does provide one of the clearest explanations for the need of the ROC curve that I have seen:
"The problem of signal detectability treated in this paper is the following: Suppose an observer is given a voltage varying with time during a prescribed observation interval and is asked to decide whether its source is noise or is signal plus noise. What method should the observer use to make this decision, and what receiver is a realization of that method?"
And a clear explanation for the plot axes:
"By the very definition of the ROC curve, the X coordinate is the conditional probability, F, of false alarm, and the Y coordinate is the conditional probability of a hit."
What is nice about this definition is that it uses terms that are, in my opinion, more clear than true positive and false positive. This paper shows more generally that the authors understood ROC curves well:
"An optimum observer required to give a yes or no answer simply chooses an operating level and concludes that the receiver input required to give a yes or no answer simply chooses an operating level and concludes that the receiver input arose from signal plus noise only when this level is exceeded by the output of his likelihood ratio receiver."
As I understand it, the authors here describe a solution to the problem: determine a threshold (operating level) for the receiver input and respond yes if the input is above the threshold, otherwise no. Put differently, say the receiver input is a value of between 0 and 1, then the "optimum observer" can choose a threshold of, say, 0.55, and if the input is above 0.55, the receiver says yes, otherwise no.
The authors continue:
"Associated with each such operating level are conditional probabilities that the answer is a false alarm and the conditional probability of detection."
So here we do see some evidence that the ROC curve was used for interpreting radar signals since they use the words "false alarm" and "detection". Furthermore, they state that
"Graphs of these quantities, called receiver operating characteristic, or ROC, curves are convenient for evaluating a receiver. If the detection problem is changed by varying, for example, the signal power, then a family of ROC curves is generated."
From this, it sounds like the ROC curves were not necessarily used by radar operators but by engineers. I suspect this because the task of "evaluating a receiver" sounds like a one-time task and because the procedure is quite complex; especially in the description as provided in this paper.
Relatedly, engineers also have to decide on the waveform of the radar signal, how to amplify and denoise the response signal, and to convert this signal to a video signal (Cook, 1967). In Fig 1.1 of the book by Cook, there is a "detector and video amplifier" step before the "output display" step. This "detector" suggests that the radar signal is converted from a continuous signal to a binary signal. Although the word "detector" can also be used to measure how much of something there is, I think that's unlikely. It is unlikely that the "receiver operator" would need to inspect the strenght of the video signal and convert that to a binary decision via a threshold.
Furthermore, one of the references of the paper is a book published by Lawson and Uhlenbeck (1950) at MIT. Chapter 7 slightly argues in favor of the operator doing the thresholding. The book states:
"Since in the last analysis a human observer must judge, either visually or aurally, whether the signal is present or not, it is clear that some of the psychophysiological properties of the eye or the ear will influence the signal threshold. For instance, in the visual observation of a radar signal on an A-scope or PPI, enough light must be produced on the screen to make the display visible. In other words there is a brightness limit for the detection of a signal."
Interestingly, this book then continues by depicting a "betting curve" in Fig 7.2. This curve plots the signal strength on the x-axis against the successes in percent on the y-axis. So here we have almost a ROC curve because the y-axis seems to be identical to the true positive rate. The x-axis is not the false positive rate, but the signal strength. This suggests that the ROC curve was not known yet in 1950, or at least not with that name.
Another reference mentioned by Peterson is Davies (1952). Davies denotes \(\overline{p}(E|S)\) as the mean probability of received signal energy given that the signal is transmitted. This is effectively the true positive rate. Davies also used \(E/N_0\) to denote the received signal energy to noise ratio. This is effectively the \(\frac{\text{true positive rate}}{\text{false positive rate}}\). Davies then plots these in Fig. 1 and 2 of the paper. This also suggests that the ROC curve was not known yet in 1952, at least not at the Telecommunications Research Establishment in the UK, where the author worked at the time.
Dilip Sarwate (a retired professor of electrical and computer engineering) on stats.stackexchange.com agrees that the earliest reference that he knows of is from 1953, namely Woodward (1953). Indeed, in that book Chapter 8.2 dicusses F and L, where F is the
"probability, when the signal is absent, of what is called in radar a 'false alarm'"
and L is the
"probability, when the signal is present, of missing it. We may call L the 'loss probability'. The value 1-L is often called the probability of detection, [...] Radar designers are much concerned with F and L. [...] The right balance may well be influenced by prior probabilities, taken in conjunction with the kind of disaster which would result from mistakes in one or the other kind."
In other words, F is the false positive rate and L is the false negative rate. Next, these are plotted against each other in Fig. 22 of the book. Although Woodward calls this a LF error diagram, I think most people would agree that this is essentially a ROC curve. The only difference is the name and plotting the false negative rate instead of the true positive rate. Furthermore, Prof. Sarwate also adds that the concept was developed during World War II.
The Wikipedia page on the Chain Home (CH) radar system has some more information about the war period. It states that the first military radar system was operational in August 1938. To calibrate the cathode-ray tube (CRT) (read the display of the oscilloscope), known aircraft flew over a known landmark.
Contrary to my earlier argument saying that the conversion from continuous to binary was likely not done by the operators, Neale (1985) writes:
"It should be mentioned at this point that the great success of CH was due in no small measure to the incredible acquired skill of experienced operators, particularly the WAAFS (Women's Auxiliary Air Force). Signals at extreme ranges, well below 'noise' level, were detected and tracked. The mechanism by which this was achieved is still not fully understood but believed to be due to an unconscious form of pattern recognition within the noise structure, somewhat analogous to the 'cocktail party' effect. Also, unlike scanning (searchlight) radars, CH, being a 'floodlit' system, provided up-date at p.r.f. rate with a corresponding integration gain when using a CRT with a long persistence phosphor. Figure 6 illustrates typical performance achieved by experienced operators on an average size bomber such as the Heinkel 111."
From this writing, it appears that ROC curves were not used at the time to find the right threshold. Instead, the operators looked at the continuous signal and made the decision themselves.
Next, this signal would be sent from the radar station to the filter room. The filter room would combine the information and prefix it by an X. Once identified, the plane would be given an H for hostile or F for friendly prefix.
In some way, this makes sense since picking a threshold would remove a lot of information. Instead, the operators could use the low signal information, which would technically be below the threshold value, to make a decision. They most likely did this by not looking at one point in time but by noticing that some "noise" was actually having a regular pattern and behaving like a plane.
For a bit of background, this is what the operators would see:
By Radar Museum, NEDAD.2013.047.058A, CC BY-SA 3.0, Link
The radars at the time were unable to rotate, so the operators would see one line on the CRT. This line would could show a blip at multiple distances depending on whether the signal was reflected by an object (or noise).
Furthermore, Neale writes:
"Many ingenious devices, including optical converters, and calculators, too numerous to describe here, were introduced in the latter stages of the war which made the Chain Home system extremely efficient and reliable."
And that a, so called, Fruit Machine was used to make the work of the operators easier. Based on the writing by Neale, this machine would do various calculations for determining the location of the plane, but not for determining whether the signal was noise or a plane.
Another reference mentioned by Peterson et al. is Fox (1953). To reiterate, Peterson, Birdsall, and Fox appear to be the first mention of ROC curves, but they do not provide a history. In Fox's paper, however, the following is written:
"This curve will be called the Receiver Operating Characteristic (briefly, ROC) curve [...]"
It then continues to give an extensive definition and describes the value of the ROC curve. For the definition, the x-axis is the conditional probability of false alarm (false positive rate) and the y-axis is the conditional probability of a hit (true positive rate).
In the conclution, Fox further defends the use of the ROC curve by writing:
"In the absense of experimental verification of the accuracy of the ROC curve in predicting the performance of the optimum receiver, there is one remaining fact which could be interpreted as casting doubt on the reliability of the theory so far developed."
In the references, Fox attributes the idea of false alarm and detection probabilities to an example given by Kaplan & Fall (1951). Kaplan & Fall write:
"The advent of high-speed scanning radars which yield but a few pulses per scan element and displays which collapse the range co-ordinate has necessitated an analytical approach to the problem of radar range performance, as empirical methods have proven inadequate."
Furthermore, they write:
"It is the primary purpose of this paper to demonstrate a method of calculating the range performance of a radar on a statistical basis, that is, for a given range, the probability of detection of a target and the probability of occurrence of false target echoes may be stated. It is no longer necessary to use such quantities as maximum effective range or maximum useful range which are subject to wide variations in calculation and interpretation, or to ignore false target echoes and their consequences."
In the plots, various curves are shown; none of which are ROC curves. Like what Davies published a year later, most plots have the signal-to-noise ratio on one of the plot axes, and a clear focus on the applicability to radar systems.
The authors conclude by saying:
"The authors feel that the interpretation of range performance of a radar as a statistical quantity is a new and valuable one. Using the methods presented in this paper, the probability of detection of a given target at a particular range and the probability of occurence of false echoes may be determined."
And argue that their statistical method can be used for more accurate comparisons between existing radars and result in a systematic optimization of the radar system parameters.
Finally, one of the references by Fox in 1953 is a technical report by Peterson & Birdsall. This report, called technical report no. 13, is from June 1953. Here they talk about the receiver operating characteristic without clarifying whether they invented it or whether they are re-using it.
Contrary to what is stated on Wikipedia and other recent sources, there is no evidence to support that the ROC curve was used during World War II. Instead, British operators would manually look at the signal and determine whether they were looking at an enemy plane or noise. Furthermore, in 1950 a 400 pages book on threshold signals does mention a curve that is almost a ROC curve, but is not exactly it. Furthermore, other papers also mention similar ideas, but often have one or two differences in the choice of axes. Only in 1951, Kaplan & Fall introduce a statisical way to determine the range performance of a radar system. This example then provides Fox with the ideas of "false alarm probability" and "probability of detection". In 1953, Woodward also uses these ideas to plot a LF error diagram, which is essentially a ROC curve.
In June and December 1953, Peterson, Birdsall, and Fox appear to be the first to talk about a "ROC curve" and show its value. Therefore, the ROC curve was likely invented in the US somewhere between 1945 and 1953 by Fox, Peterson, and Birdsall at the US Army Signal Corps. Given that researchers in the UK were publishing similar ideas, it is likely that the ROC curve was known more widely, but not yet called as such.