Introduction

Cultural eutrophication, or the enrichment of water bodies with nitrogen and phosphorous from anthropogenic sources, is a major water quality problem worldwide (; ), with eutrophic waterways increasing in frequency and severity (). Unfortunately, many monitoring agencies are facing resource constraints in the form of personnel, time, and budget limitations that prevent them from adequately addressing pressing environmental issues (; ) such as cultural eutrophication. The recruitment of citizen scientists has been suggested as one method to supplement agency monitoring (; ; ; ); however, if such data are to be trusted by researchers and regulatory agencies, it is necessary that volunteers produce data that is as accurate as possible ().

One tool that is currently available for public evaluation of nitrate concentration is the Hach© Nitrate strip. The test strip is a colorimetric assay that is a modification of the Griess reaction (), performing according to the Beer-Lambert law, such that an increase in color intensity is proportional to an increase in nitrate concentation. As such, the reported accuracy of the strip is primarily limited by the sensor that quantifies the change in intensity, in this case the human eye, not by the chemical reaction within the strip itself. In this methodological paper, we tested the accuracy of the human eye versus a camera-based smartphone application in judging the intensity of color of nitrate strips.

Citizen scientists have commonly quantified results obtained with the Hach© Nitrate test strip visually (; ; ). Recently, it has been proposed that cell phones can increase the quality of data collected by citizen scientists (), which may include increasing the accuracy of data collected using the Hach© Nitrate strip. This assertion is supported by the fact that technological advancements in smartphone cameras now allow them to be used as relatively low-cost spectrometers ().

The objective of this work was to assess the accuracy of citizen scientists quantifying results from the Hach© Nitrate test strips with and without the addition of the Deltares smartphone application. In one set of tests, results from volunteers who visually estimated two different nitrate concentrations were compared with the results of volunteers who used the Deltares smart phone app—a platform designed by Deltares, a surface and subsurface water research institute (https://www.deltares.nl/en/). In a second series of tests, volunteers quantified nitrate concentration in a number of solutions that spanned the range that the Hach© Nitrate strip can perceive (0–50 mg/L or ppm NO3-N). Results suggest that, in both cases, the phone app did not increase the accuracy of the results.

Methods

Recruitment of citizen scientists

Citizen scientists were recruited from various populations through the coordination of 12 separate testing events in Idaho and Washington. These events were hosted on university campuses, at local scientific meetings, and with high school groups from February 2019 to February 2020. In a manner consistent with , participating volunteers had varied skill levels and backgrounds (Table 1). Furthermore, while water resource professionals (experts) were included in the population of volunteers, it was not expected that their findings would be more accurate than volunteers with less experience in the water resources field (). The events ranged in participation from 3 to 23 volunteers, with a total of 142 citizen scientists participating in the testing.

Table 1

Organizations that participated in testing events with their corresponding test dates and sample sizes.


TESTING EVENTDATESAMPLE SIZEVOLUNTEER TYPE

ORED staff2/26/1923University of Idaho staff

Idaho commons3/22/1923University of Idaho college students, staff

ORED open house4/4/1911General college population

Spokane River forum4/16/1916Water professionals, educators, general public

Columbia High School5/28/1915High school students

Palouse Basin aquifer committee meeting10/10/1913General public, water professionals, students

OurGem symposium11/6/199General public, water professionals

Idaho Water Institute symposium11/12/193Water resources graduate students, faculty

Idaho commons12/6/199University of Idaho college students

Idaho water quality workshop2/11/2010Water professionals, students, general public, faculty

Continuum test: visual samplers6/24/195Idaho Water Institute staff and interns

Continuum test: app samplers1/24/205Water resources graduate students

Notes: ORED = Office of Research and Economic Development.

Nitrate test strips

Volunteers measured nitrate concentrations using Hach© test strips. When quantified visually, these test strips have been used in a variety of citizen science monitoring programs (; ) and have been validated previously in laboratory-controlled experiments ().

Sample preparation

Each testing event required volunteers to quantify nitrate concentrations in prepared spiked deionized water samples. All the nitrate samples were prepared using KNO3 and preserved with sulfuric acid. New solutions were made for each of the 12 testing events, and each concentration was confirmed analytically either by the University of Idaho’s analytical laboratory or through use of an in-house discrete analyzer (Seal AQ400: Method: EPA-114-C).

Study design

Two experiments were conducted to address the objectives of this study. The first aimed to assess the accuracy of data produced by volunteers using app and visual methods when evaluating two different laboratory-prepared nitrate concentrations. The second experiment quantified a continuum of laboratory-prepared nitrate concentrations with the intent of evaluating how well visual volunteers and those using the Deltares Nitrate App could quantify the results.

Experiment 1: humans versus smartphone app

To accomplish the goal of the first experiment, 132 volunteers were provided a nitrate sample and instructed to quantify the concentration either visually or using the app. The app volunteers used an iPad equipped with the Deltares Nitrate App to quantify the concentration of their sample. The visual volunteers quantified their sample visually using the colorimetric scale provided by the test strip instructions. All of the volunteers conducted their testing indoors, under similar lighting conditions, because the Deltares Nitrate App can be influenced by poor or inconsistent light. Volunteers were given a water sample and written, but not verbal, instructions to follow. The written instructions emphasized among other things that strip intensity was time sensitive, and that failure to accurately control the timing of the incubation could lead to inaccurate results.

When quantifying the nitrate strip, visual volunteers assigned the color to one of the distinct color categories provided on the side of the bottle. In contrast, volunteers quantifying the strip using the app generated concentrations along a continuous scale. To directly compare the results from both tools, it was necessary that categorical data and continuous data were collected by volunteers using both the app and visual methods. Consequently, in an initial battery of tests, some of the volunteers that collected their data visually were asked to interpolate between the different categorical bins and generate an integer concentration between 1 and 50 ppm, while the remaining volunteers categorized their data into one of the discrete bins. Consistently, half of the app volunteers recorded continuous data, while the remaining volunteers put their data into one of the bins as discrete categories indicated on the Hach© bottle.

Of the recruited volunteers, 66 app and 66 visual samplers participated in this part of the project (Table 2). All of the volunteers were asked to quantify the concentration of a water sample, which was either 2 or 15 ppm NO3. 2 ppm was chosen because it was a categorical option for those testing visually, while 15 ppm was chosen because it was equidistant between two categorical options: 10 and 20 ppm. Relative to the 2 ppm solution, responses were assumed to be accurate if the result was between the two flanking categories (i.e., 1.1 to 4.9 ppm). Values below 1.1 ppm were considered underestimates, and values above 4.9 ppm were considered overestimates. Relative to 15 ppm, solution responses were assumed to be accurate if the result fell between the two categories that flanked 15 (i.e., 10 or 20 ppm). Values below 10 ppm were considered underestimations, and those above 20 ppm were considered overestimates.

Table 2

Breakdown of volunteers using each quantification method using either categorical or continuous instructions to test their sample.


NITRATE CONCENTRATION (PPM)TOOLCATEGORICAL INSTRUCTIONS N =CONTINUOUS INSTRUCTIONS N =

2App1716

Eye1816

15App1716

Eye1715

Experiment 2: quantifying a continuum of nitrate concentrations

The second experiment was conducted to understand how each tool performed when tested on a continuum of nitrate samples. Ten individuals were tasked with quantifying a continuum of 25 randomized nitrate samples (Table 1). Half of these samplers were instructed to visually quantify their test strip and categorize their samples into the discrete bins on the Hach© bottle. The other half of the samplers were instructed to continually quantify their 25 samples using the Deltares Nitrate App. The 25 samples began at 1 ppm and increased to 50 ppm nitrate on every odd value.

Statistical analysis

In the first experiment, Fisher’s exact tests were used to determine if data type had any impact on the accuracy of results, and to determine whether there was any difference in accuracy between the app and visual volunteers when combining results across both concentrations.

The second experiment was conducted with the intent of understanding how each analytic tool performed when quantifying a continuum of nitrate samples. It was necessary to transform the continuous samples used by the volunteers using the app into the same categories that the visual volunteers used. To do this, the continuous values were binned into corresponding categories as per Muenich et al. () who similarly compared continuous lab samples to categorical field samples (Table 3). These data were plotted and analyzed statistically using Spearman’s correlation to determine the relationship between the binned continuum samples and the volunteers’ recorded categories.

Table 3

Test strip categories used by visual volunteers and the ranges of continuous nitrate samples placed in those bins.


HACH© TEST STRIP SCALE (PPM)CONTINUOUS SAMPLES ASSIGNED TO EACH CATEGORY (PPM)

0.0n/a

1.01

2.03

5.05, 7

10.09, 11, 13, 15

20.017, 19, 21, 23, 25, 27, 29, 31, 33, 35

50.037, 39, 41, 43, 45, 47, 49

For the app volunteers, the responses did not need binning because both data types were continuous. The volunteer data were plotted against the true nitrate sample concentrations and fitted with a linear regression model.

All statistical analyses for this project were preformed using either JMP (v. 14.0) or Microsoft Excel (v. 16.33) software with α = 0.05.

Results

Experiment 1: humans versus smartphone app

Visual volunteers

Thirty-four volunteers quantified a 2-ppm nitrate solution using visual methods. Of the 34 volunteers, 18 estimated concentration by category, and 16 estimated concentration by extrapolation to a continuous scale. Categorical volunteers were accurate 89% of the time, and continuous volunteers were accurate 75% of the time. The proportions of accurate to inaccurate results were not significantly different (Fisher’s exact test, p = 0.3872) between the two quantification methods. Interestingly, all the volunteers that were incorrect overestimated the concentration. The app volunteers produced a wider range of overestimation values (5–36 ppm) than the visual volunteers, who were closer to the true concentration (5 ppm).

Thirty-two volunteers quantified a 15-ppm nitrate solution using visual methods. Of the 32 volunteers, 17 estimated concentration by category, and 15 estimated concentration by extrapolation to a continuous scale. Continuous volunteers were accurate 80% of the time, and categorical volunteers produced 100% accurate results. The proportions of accurate to inaccurate results were not significantly different (Fisher’s exact test, p = 0.0917) between the two quantification methods. In contrast to the samplers testing at 2 ppm, continuous volunteers both underestimated and overestimated the nitrate concentration.

When the results from all 66 volunteers were combined and analyzed together, there were no significant differences in accuracy between the visual volunteers who estimated nitrate concentration by category relative to those who estimated concentration by interpolation to a continuous scale (Fisher’s exact test, p = 0.719).

App volunteers

Thirty-three volunteers quantified a 2-ppm nitrate solution using the Deltares Nitrate App, with 17 volunteers who estimated concentration by category, and 16 who estimated concentration by extrapolation to a continuous scale. Categorical volunteers, who were instructed to estimate their sample to the nearest concentration bin, were accurate 52% of the time, and continuous volunteers, who were allowed to interpolate their sample concentration, were accurate 75% of the time. There were no significant differences in accuracy between the categorical and continuous results (Fisher’s exact test, p = 0.2818).

Thirty-three volunteers quantified a 15-ppm nitrate solution using the Deltares Nitrate App. Of the 33 volunteers, 17 estimated concentration by category, and 16 estimated concentration by extrapolation to a continuous scale. Continuous volunteers were accurate 44% of the time, and categorical volunteers produced accurate results 24% of the time. There were no significant differences in accuracy between the categorical and continuous results (Fisher’s exact test, p = 0.2818).

When the results from all 66 volunteers were combined and analyzed together, there were no significant differences in accuracy between the app users who estimated nitrate concentration using categories relative to those who estimated concentration by interpolation to a continuous scale (Fisher’s exact test, p = 0.1387).

Visual versus app volunteers

To compare the two analytic tools, all the responses were pooled into accurate response or inaccurate response from the 66 visual and 66 app volunteers, regardless of data type or concentration. The proportions of accurate to inaccurate responses were determined for each tool and then were analyzed using a Fisher’s exact test. The results indicate that volunteers using visual methods are statistically more likely to be accurate than their app-testing counterparts (Fisher’s exact test, p < 0.00001) (Figure 1). The data were further broken down into the proportion of accurate to inaccurate responses at the two concentrations, and were then analyzed using a Fisher’s exact test. The findings indicate that at 2 ppm, results from the visual and app volunteers were not statistically different (Fisher’s exact test, p = 0.1036) from each other, whereas at 15 ppm, the visual volunteers were statistically more likely to be accurate than their app-testing counterparts (Fisher’s exact test, p < 0.00001*).

Figure 1 

Graphical representation of the proportions of accurate and inaccurate responses for both analytic tools.

Experiment 2: quantifying a continuum of nitrate concentrations

App volunteers

The second experiment was conducted with the intent of understanding how each analytic tool performed when quantifying a continuum of nitrate samples. The continuous data produced by the app volunteers were plotted against the true nitrate sample concentration (Figure 2). A linear regression model explained more than 75% of the total variation in the data (y = 1.0369x +4.9569, R2 = 0.77, p = 0.0011*).

Figure 2 

A scatter plot of data produced by continuous app users that compares the actual concentration of nitrate to the volunteers’ recorded values.

Regression residuals were calculated by subtracting the volunteer’s continuous response from the true nitrate concentration. Of the 125 concentration estimates, 18 of the residuals were underestimations, 14 were accurate, and 93 were overestimations. These observations were compared against a uniform proportion of expected results (33% for each category). A chi-square test of independence was then performed on these proportions to examine the relationship between the sign of regression residuals (positive, negative, or zero) and the expected values. The relationship between these variables was significant (X2 = 42.12, p < 0.00001*), indicating that there is a statistically significant relationship between sign and expected values. Most of the residuals were negative, indicating that the app tends to overestimate.

Visual volunteers

To determine the accuracy of volunteers visually measuring nitrate across a continuum of concentrations, the continuous sample concentrations were binned into one of the six existing Hach© categories as per (Table 3, Figure 3). A comparison was then made of the categorical results that the volunteers generated to the actual concentrations after the data were binned (Figure 4). There was a strong positive correlation (Spearman rank correlation, ρ = 0.8735, p < 0.0001*) between the true binned concentrations and the categorical estimates of the volunteers.

Figure 3 

The dots represent continuous sample concentrations and their designated Hach© category bins. For example, concentrations that fall between 15.1 and 35.9 ppm (or mg/L) would be binned into category 5.

Figure 4 

The relationship between nitrate concentrations and corresponding response categories produced by visual volunteers. The blue color corresponds to the samples in the lower range with smaller residuals and the orange corresponds to the samples in the higher range with residuals across three categories. The size of the circles provides a visual approximation of the proportion of samples that fall within each response concentration. In each case the total proportion equals 100%. The response concentration categories are defined by Hach as: 0 = 0ppm, 1 = 1ppm, 2 = 2ppm, 3 = 5 ppm, 4 = 10 ppm, 5 = 20 ppm, and 6 = 50 ppm.

Regression residuals were calculated by taking the volunteer’s response category from the true concentration bin. These data were plotted against the actual concentrations. For the lowest four categories, not including zero (1–4), residuals were off by only one category. The higher concentration bins displayed higher residual ranges, indicating that as nitrate concentration increased, so did the range of categories that were recorded by the volunteers.

Categorical responses were broken into two groups—orange, which corresponded to 0–15 ppm (binned concentrations 0–4), and blue, which corresponded to 17–49 ppm (binned categories 5 and 6), in the same manner as with the continuous analysis. The Spearman’s rank correlation was statistically significant (ρ = 0.8523, p < 0.0001* for the categories between 0–4 as well as categories 5 and 6 (ρ = 0.6431, p < 0.0001* Figure 4).

Discussion

The objective of this project was to assess the accuracy of citizen scientists measuring nitrate concentrations using the Hach© Nitrate test strips with and without the addition of the Deltares smartphone application. The results do not suggest that using a cell app increases the accuracy of first-time volunteers.

In the first experiment, volunteers that quantified the concentration on their nitrate strips by eye were more accurate than the volunteers who used the app (Figure 1). Because all of the volunteers were first-time volunteers, it is possible that an increase in volunteer experience may have increased data accuracy (see Kosmala et al. 2016).

It was also observed that volunteers using both the app and visual quantitative methods tended to overestimate nitrate concentrations. These findings are consistent with those of Ali et al. (), who suggested that improper timekeeping may have been responsible for the overestimations of their volunteers. Given that the Deltares Nitrate App has a built-in timer, it is possible that the timer actually worked against the volunteers, as it was noticed that some volunteers hesitated between immersing the strip and observing the time on the continually rolling timer. Other factors, such as lighting variations or the angles at which the device was positioned might also be responsible for the overestimations. Cell phone apps that generate continuous results from colorimetric assays can be biased due to lighting variations, angles, and device type (; ; ). The findings from this study suggest that the Deltares Nitrate App might also be sensitive to changes in ambient lighting, which could be problematic for volunteers recording data in the field under varying weather conditions and light intensities.

In the second experiment, volunteers were tasked with quantifying 25 randomized water samples that ranged from 0 to 50 ppm nitrate. Volunteers that visually quantified the strips categorized their results into one of the seven concentration bins as per the Hach© instructions, whereas volunteers using the app produced data on a continuous scale ranging from 0 to 50 ppm. Results from both groups increased in variation with increasing sample concentration. For both groups, concentrations between 1 and 15 ppm (categories 0–4) experienced lower variability, and estimates of concentrations between 17 and 49 ppm (categories 5 and 6) were decidedly more variable.

In the second experiment, the volunteers that used the app were prone to overestimation, which was consistent with the results from the first test. Unlike the first round of testing, these volunteers were more experienced with the platform after testing 25 consecutive samples, so inexperience with timekeeping is less likely responsible for their overestimations. Instead, these overestimations are likely the result of the app software consistently overestimating the test results. In contrast to the overestimations produced by app volunteers, the volunteers who visually estimated their 25 samples were more likely to underestimate. These findings are a bit more difficult to explain, as both Ali et al. () and our findings from the first test indicate that novice volunteers tend to overestimate. Once again, these were not inexperienced volunteers, as they tested 25 samples in a row, so inaccuracies due to timekeeping errors were likely not the explanation for these findings. These results could be due to difficulties perceiving slight chromatic color changes between the higher categories of 10, 20, and 50, which are less stark than the color changes for the lower ranges.

Citizen scientists benefit from the use of cell phone apps, as they gain a powerful analytic tool right in their hands that allows for the incorporation of GPS information and rapid data transmission to be combined with human observation (). If these apps are to be useful outside a controlled setting, they must be flexible enough to accommodate for external variabilities (; ; ) and must be approachable for first time users. Unfortunately, the results from this study suggest that further refinement of the tool will be necessary for cell phone apps to reach their full potential relative to crowdsourced data recovery.

Data Accessibility Statements

The data used in this research project has not been made available but could be made available upon request.