Assessing the Accuracy of Nitrate Concentration Data for Water Quality Monitoring Using Visual and Cell Phone Quantification Methods

In this methodological project we tested the accuracy of two systems used to quantify results obtained using the Hach© nitrate strip for water quality volunteer monitoring programs. The test strip determines nitrate concentration in accordance with the LambertBeer law as increased nitrate concentrations result in greater color intensity on the strip’s sampling pad. In this study, first-time volunteers estimated nitrate concentrations with the test strip, either visually or using the Deltares Nitrate App, a smartphone application that uses the phone’s camera as a spectrometer. Results from two different series of tests indicate that volunteers using visual methods produce the more accurate results. Although cell phone apps might have the potential to increase data quality for colorimetric assays such as the one employed by the Hach© nitrate test strip, the current technology was not an improvement relative to visual interpolation. CORRESPONDING AUTHOR: Melissa Topping University of Idaho, US


INTRODUCTION
Cultural eutrophication, or the enrichment of water bodies with nitrogen and phosphorous from anthropogenic sources, is a major water quality problem worldwide (Carpenter et al. 1998;Malone and Newton 2020), with eutrophic waterways increasing in frequency and severity (Malone and Newton 2020). Unfortunately, many monitoring agencies are facing resource constraints in the form of personnel, time, and budget limitations that prevent them from adequately addressing pressing environmental issues (Conrad and Hilchey 2011;Wyeth et al. 2019) such as cultural eutrophication. The recruitment of citizen scientists has been suggested as one method to supplement agency monitoring (Cohn 2008;Hadj-Hammou et al. 2017;Thornhill, Chautard and Loiselle 2018;Wyeth et al. 2019); however, if such data are to be trusted by researchers and regulatory agencies, it is necessary that volunteers produce data that is as accurate as possible (Jollymore et al. 2017).
One tool that is currently available for public evaluation of nitrate concentration is the Hach© Nitrate strip. The test strip is a colorimetric assay that is a modification of the Griess reaction (Nelson, Kurtz, and Bray 1954), performing according to the Beer-Lambert law, such that an increase in color intensity is proportional to an increase in nitrate concentation. As such, the reported accuracy of the strip is primarily limited by the sensor that quantifies the change in intensity, in this case the human eye, not by the chemical reaction within the strip itself. In this methodological paper, we tested the accuracy of the human eye versus a camerabased smartphone application in judging the intensity of color of nitrate strips.
Citizen scientists have commonly quantified results obtained with the Hach© Nitrate test strip visually (Loperfido et al. 2010;Muenich et al. 2016;Ali et al. 2019). Recently, it has been proposed that cell phones can increase the quality of data collected by citizen scientists (Burke et al. 2006), which may include increasing the accuracy of data collected using the Hach© Nitrate strip. This assertion is supported by the fact that technological advancements in smartphone cameras now allow them to be used as relatively low-cost spectrometers (McGonigle et al. 2018).
The objective of this work was to assess the accuracy of citizen scientists quantifying results from the Hach© Nitrate test strips with and without the addition of the Deltares smartphone application. In one set of tests, results from volunteers who visually estimated two different nitrate concentrations were compared with the results of volunteers who used the Deltares smart phone app-a platform designed by Deltares, a surface and subsurface water research institute (https://www.deltares.nl/en/). In a second series of tests, volunteers quantified nitrate concentration in a number of solutions that spanned the range that the Hach© Nitrate strip can perceive (0-50 mg/L or ppm NO 3 -N). Results suggest that, in both cases, the phone app did not increase the accuracy of the results.

RECRUITMENT OF CITIZEN SCIENTISTS
Citizen scientists were recruited from various populations through the coordination of 12 separate testing events in Idaho and Washington. These events were hosted on university campuses, at local scientific meetings, and with high school groups from February 2019 to February 2020. In a manner consistent with Ali et al. 2019, participating volunteers had varied skill levels and backgrounds (Table 1). Furthermore, while water resource professionals (experts) were included in the population of volunteers, it was not expected that their findings would be more accurate than volunteers with less experience in the water resources field (Ali et al. 2019). The events ranged in participation from 3 to 23 volunteers, with a total of 142 citizen scientists participating in the testing.

NITRATE TEST STRIPS
Volunteers measured nitrate concentrations using Hach© test strips. When quantified visually, these test strips have been used in a variety of citizen science monitoring programs (Loperfido et al. 2010;Muenich et al. 2016) and have been validated previously in laboratory-controlled experiments (Ali et al. 2019).

SAMPLE PREPARATION
Each testing event required volunteers to quantify nitrate concentrations in prepared spiked deionized water samples. All the nitrate samples were prepared using KNO 3 and preserved with sulfuric acid. New solutions were made for each of the 12 testing events, and each concentration was confirmed analytically either by the University of Idaho's analytical laboratory or through use of an in-house discrete analyzer (Seal AQ400: Method: EPA-114-C).

STUDY DESIGN
Two experiments were conducted to address the objectives of this study. The first aimed to assess the accuracy of data produced by volunteers using app and visual methods when evaluating two different laboratory-prepared nitrate concentrations. The second experiment quantified a continuum of laboratory-prepared nitrate concentrations with the intent of evaluating how well visual volunteers and those using the Deltares Nitrate App could quantify the results.

Experiment 1: humans versus smartphone app
To accomplish the goal of the first experiment, 132 volunteers were provided a nitrate sample and instructed to quantify the concentration either visually or using the app. The app volunteers used an iPad equipped with the Deltares Nitrate App to quantify the concentration of their sample. The visual volunteers quantified their sample visually using the colorimetric scale provided by the test strip instructions. All of the volunteers conducted their testing indoors, under similar lighting conditions, because the Deltares Nitrate App can be influenced by poor or inconsistent light. Volunteers were given a water sample and written, but not verbal, instructions to follow. The written instructions emphasized among other things that strip intensity was time sensitive, and that failure to accurately control the timing of the incubation could lead to inaccurate results.
When quantifying the nitrate strip, visual volunteers assigned the color to one of the distinct color categories provided on the side of the bottle. In contrast, volunteers quantifying the strip using the app generated concentrations along a continuous scale. To directly compare the results from both tools, it was necessary that categorical data and continuous data were collected by volunteers using both the app and visual methods. Consequently, in an initial battery of tests, some of the volunteers that collected their data visually were asked to interpolate between the different categorical bins and generate an integer concentration between 1 and 50 ppm, while the remaining volunteers categorized their data into one of the discrete bins. Consistently, half of the app volunteers recorded continuous data, while the remaining volunteers put their data into one of the bins as discrete categories indicated on the Hach© bottle.
Of the recruited volunteers, 66 app and 66 visual samplers participated in this part of the project ( Table 2). All of the volunteers were asked to quantify the concentration of a water sample, which was either 2 or 15 ppm NO 3 . 2 ppm was chosen because it was a categorical option for those testing visually, while 15 ppm was chosen because it was equidistant between two categorical options: 10 and 20 ppm. Relative to the 2 ppm solution, responses were assumed to be accurate if the result was between  the two flanking categories (i.e., 1.1 to 4.9 ppm). Values below 1.1 ppm were considered underestimates, and values above 4.9 ppm were considered overestimates. Relative to 15 ppm, solution responses were assumed to be accurate if the result fell between the two categories that flanked 15 (i.e., 10 or 20 ppm). Values below 10 ppm were considered underestimations, and those above 20 ppm were considered overestimates.
Experiment 2: quantifying a continuum of nitrate concentrations The second experiment was conducted to understand how each tool performed when tested on a continuum of nitrate samples. Ten individuals were tasked with quantifying a continuum of 25 randomized nitrate samples (

STATISTICAL ANALYSIS
In the first experiment, Fisher's exact tests were used to determine if data type had any impact on the accuracy of results, and to determine whether there was any difference in accuracy between the app and visual volunteers when combining results across both concentrations. The second experiment was conducted with the intent of understanding how each analytic tool performed when quantifying a continuum of nitrate samples. It was necessary to transform the continuous samples used by the volunteers using the app into the same categories that the visual volunteers used. To do this, the continuous values were binned into corresponding categories as per Muenich et al. (2016) who similarly compared continuous lab samples to categorical field samples ( Table 3). These data were plotted and analyzed statistically using Spearman's correlation to determine the relationship between the binned continuum samples and the volunteers' recorded categories.
For the app volunteers, the responses did not need binning because both data types were continuous. The volunteer data were plotted against the true nitrate sample concentrations and fitted with a linear regression model.
All statistical analyses for this project were preformed using either JMP (v. 14.0) or Microsoft Excel (v. 16.33) software with α = 0.05.

EXPERIMENT 1: HUMANS VERSUS SMARTPHONE APP Visual volunteers
Thirty-four volunteers quantified a 2-ppm nitrate solution using visual methods. Of the 34 volunteers, 18 estimated concentration by category, and 16 estimated concentration by extrapolation to a continuous scale. Categorical volunteers were accurate 89% of the time, and continuous volunteers were accurate 75% of the time. The proportions of accurate to inaccurate results were not significantly different (Fisher's exact test, p = 0.3872) between the two quantification methods. Interestingly, all the volunteers that were incorrect overestimated the concentration. The app volunteers produced a wider range of overestimation values (5-36 ppm) than the visual volunteers, who were closer to the true concentration (5 ppm).
Thirty-two volunteers quantified a 15-ppm nitrate solution using visual methods. Of the 32 volunteers, 17 estimated concentration by category, and 15 estimated concentration by extrapolation to a continuous scale. Continuous volunteers were accurate 80% of the time, and categorical volunteers produced 100% accurate results. The proportions of accurate to inaccurate results were not significantly different (Fisher's exact test, p = 0.0917) between the two quantification methods. In contrast to the samplers testing at 2 ppm, continuous volunteers both underestimated and overestimated the nitrate concentration.
When the results from all 66 volunteers were combined and analyzed together, there were no significant differences in accuracy between the visual volunteers who estimated nitrate concentration by category relative to those who estimated concentration by interpolation to a continuous scale (Fisher's exact test, p = 0.719).

App volunteers
Thirty-three volunteers quantified a 2-ppm nitrate solution using the Deltares Nitrate App, with 17 volunteers  When the results from all 66 volunteers were combined and analyzed together, there were no significant differences in accuracy between the app users who estimated nitrate concentration using categories relative to those who estimated concentration by interpolation to a continuous scale (Fisher's exact test, p = 0.1387).

Visual versus app volunteers
To compare the two analytic tools, all the responses were pooled into accurate response or inaccurate response from the 66 visual and 66 app volunteers, regardless of data type or concentration. The proportions of accurate to inaccurate responses were determined for each tool and then were analyzed using a Fisher's exact test. The results indicate that volunteers using visual methods are statistically more likely to be accurate than their app-testing counterparts (Fisher's exact test, p < 0.00001) (Figure 1). The data were further broken down into the proportion of accurate to inaccurate responses at the two concentrations, and were then analyzed using a Fisher's exact test. The findings indicate that at 2 ppm, results from the visual and app volunteers were not statistically different (Fisher's exact test, p = 0.1036) from each other, whereas at 15 ppm, the visual volunteers were statistically more likely to be accurate than their app-testing counterparts (Fisher's exact test, p < 0.00001*).

EXPERIMENT 2: QUANTIFYING A CONTINUUM OF NITRATE CONCENTRATIONS App volunteers
The second experiment was conducted with the intent of understanding how each analytic tool performed when quantifying a continuum of nitrate samples. The continuous data produced by the app volunteers were plotted against the true nitrate sample concentration (Figure 2). A linear regression model explained more than 75% of the total variation in the data (y = 1.0369x +4.9569, R 2 = 0.77, p = 0.0011*).
Regression residuals were calculated by subtracting the volunteer's continuous response from the true nitrate concentration. Of the 125 concentration estimates, 18 of the residuals were underestimations, 14 were accurate, and 93 were overestimations. These observations were compared against a uniform proportion of expected results (33% for each category). A chi-square test of independence was then performed on these proportions to examine the relationship between the sign of regression residuals (positive, negative, or zero) and the expected values. The relationship between these variables was significant (X 2 = 42.12, p < 0.00001*), indicating that there is a statistically significant relationship between sign and expected values. Most of the residuals were negative, indicating that the app tends to overestimate.

Visual volunteers
To determine the accuracy of volunteers visually measuring nitrate across a continuum of concentrations, the continuous sample concentrations were binned into one of the six existing Hach© categories as per Muenich et al. 2016 (Table 3, Figure 3). A comparison was then made of the categorical results that the volunteers generated to the actual concentrations after the data were binned (Figure 4). There was a strong positive correlation (Spearman rank correlation, ρ = 0.8735, p < 0.0001*) between the true binned concentrations and the categorical estimates of the volunteers.
Regression residuals were calculated by taking the volunteer's response category from the true concentration bin. These data were plotted against the actual concentrations. For the lowest four categories, not including zero (1-4), residuals were off by only one category. The  higher concentration bins displayed higher residual ranges, indicating that as nitrate concentration increased, so did the range of categories that were recorded by the volunteers.

DISCUSSION
The objective of this project was to assess the accuracy of citizen scientists measuring nitrate concentrations using the Hach© Nitrate test strips with and without the addition of the Deltares smartphone application. The results do not suggest that using a cell app increases the accuracy of first-time volunteers.
In the first experiment, volunteers that quantified the concentration on their nitrate strips by eye were more accurate than the volunteers who used the app (Figure 1). Because all of the volunteers were first-time volunteers, it is possible that an increase in volunteer experience may have increased data accuracy (see Kosmala et al. 2016).
It was also observed that volunteers using both the app and visual quantitative methods tended to overestimate nitrate concentrations. These findings are consistent with those of Ali et al. (2019), who suggested that improper timekeeping may have been responsible for the overestimations of their volunteers. Given that the Deltares Nitrate App has a built-in timer, it is possible that the timer actually worked against the volunteers, as it was noticed that some volunteers hesitated between immersing the strip and observing the time on the continually rolling timer. Other factors, such as lighting variations or the angles at which the device was positioned might also be responsible for the overestimations. Cell phone apps that generate continuous results from colorimetric assays can be biased due to lighting variations, angles, and device type (Shen et al. 2012;Yetisen et al. 2014;Karlsen and Dong 2015). The findings from this study suggest that the Deltares Nitrate App might also be sensitive to changes in ambient lighting, which could be problematic for volunteers recording data in the field under varying weather conditions and light intensities.
In the second experiment, volunteers were tasked with quantifying 25 randomized water samples that ranged from 0 to 50 ppm nitrate. Volunteers that visually quantified the strips categorized their results into one of the seven concentration bins as per the Hach© instructions, whereas volunteers using the app produced data on a continuous scale ranging from 0 to 50 ppm. Results from both groups increased in variation with increasing sample concentration. For both groups, concentrations between 1 and 15 ppm (categories 0-4) experienced lower variability, and estimates of concentrations between 17 and 49 ppm (categories 5 and 6) were decidedly more variable.
In the second experiment, the volunteers that used the app were prone to overestimation, which was The relationship between nitrate concentrations and corresponding response categories produced by visual volunteers. The blue color corresponds to the samples in the lower range with smaller residuals and the orange corresponds to the samples in the higher range with residuals across three categories. The size of the circles provides a visual approximation of the proportion of samples that fall within each response concentration. In each case the total proportion equals 100%. The response concentration categories are defined by Hach as: 0 = 0ppm, 1 = 1ppm, 2 = 2ppm, 3 = 5 ppm, 4 = 10 ppm, 5 = 20 ppm, and 6 = 50 ppm. consistent with the results from the first test. Unlike the first round of testing, these volunteers were more experienced with the platform after testing 25 consecutive samples, so inexperience with timekeeping is less likely responsible for their overestimations. Instead, these overestimations are likely the result of the app software consistently overestimating the test results. In contrast to the overestimations produced by app volunteers, the volunteers who visually estimated their 25 samples were more likely to underestimate. These findings are a bit more difficult to explain, as both Ali et al. (2019) and our findings from the first test indicate that novice volunteers tend to overestimate. Once again, these were not inexperienced volunteers, as they tested 25 samples in a row, so inaccuracies due to timekeeping errors were likely not the explanation for these findings. These results could be due to difficulties perceiving slight chromatic color changes between the higher categories of 10, 20, and 50, which are less stark than the color changes for the lower ranges.
Citizen scientists benefit from the use of cell phone apps, as they gain a powerful analytic tool right in their hands that allows for the incorporation of GPS information and rapid data transmission to be combined with human observation (Burke et al. 2006). If these apps are to be useful outside a controlled setting, they must be flexible enough to accommodate for external variabilities (Karlsen and Dong 2015;Shen et al. 2012;Yetisen et al. 2014) and must be approachable for first time users. Unfortunately, the results from this study suggest that further refinement of the tool will be necessary for cell phone apps to reach their full potential relative to crowdsourced data recovery.

DATA ACCESSIBILITY STATEMENTS
The data used in this research project has not been made available but could be made available upon request.