CASE STUDIES Determining the Accuracy of Crowdsourced Tweet Verification for Auroral Research

The Aurorasaurus project harnesses volunteer crowdsourcing to identify sightings of an aurora (the “northern/southern lights”) posted by citizen scientists on Twitter. Previous studies have demonstrated that aurora sightings can be mined from Twitter with the caveat that there is a large background level of non-sighting tweets, especially during periods of low auroral activity. Aurorasaurus attempts to mitigate this, and thus increase the quality of its Twitter sighting data, by using volunteers to sift through a pre-filtered list of geolocated tweets to verify real-time aurora sightings. In this study, the current implementation of this crowdsourced verification system, including the process of geolocating tweets, is described and its accuracy (which, overall, is found to be 68.4%) is determined. The findings suggest that citizen science volunteers are able to accurately filter out unrelated, spam-like, Twitter data but struggle when filtering out somewhat related, yet undesired, data. The citizen scientists particularly struggle with determining the real-time nature of the sightings, so care must be taken when relying on crowdsourced identification.


Introduction
The citizen science project Aurorasaurus (MacDonald et al. 2015) has two main goals: Improving the "nowcasting" of a visible aurora (commonly known as the "northern/ southern lights") and the ability to accurately model both the size and strength of an aurora.To do this, the project collects observations of the aurora made by the general public.These observations can be submitted directly to the project, via its website (http://aurorasaurus.org) and mobile apps, and are found by searching Twitter for possible sightings.
Twitter can be a useful source of data for many citizen science projects because information is freely shared by millions of users distributed around the globe.Indeed, previous studies have shown that Twitter users, who post short updates (of a maximum 140 characters in length) known as "tweets," will often share details about the conditions around them.This is especially true for largescale events such as earthquakes (Earle et al. 2010;Crooks et al. 2013), influenza outbreaks (Culotta 2010;Lampos et al. 2010), and service outages (Motoyama et al. 2010).Case et al. (2015a) showed that Twitter can also be a useful source of data for studying the aurora by comparing the number of tweets relating to an aurora with auroral activity (or, more specifically, to common auroral activity indices).However, these authors also noted that Twitter data are particularly noisy and that many tweets containing aurora-related keywords (e.g., "aurora" and "northern lights") are not actually sightings.Often such tweets are about a person or place or the desire to witness an aurora.
The Aurorasaurus project enlists volunteers, both registered and anonymous, to sort through pre-filtered, aurora-related tweets to identify and positively verify real-time aurora sightings.While combining Twitter data with other citizen science data may be a new form of crowdsourcing, many previous studies have demonstrated that crowdsourcing can be used for data classification, often using Amazon's Mechanical Turk (Kittur et al. 2008;Ipeirotis et al. 2010).In fact, studies have shown that the crowd is sometimes more accurate than experts at identification tasks (Alonso and Mizzaro 2009).
Once a tweet has been verified as a positive sighting by the Aurorasaurus volunteers, it is treated in the same way as a direct report via the project's website or apps.The combined observations, both direct reports and positively verified tweets, are displayed on the project home page on a real-time map alongside a modeled auroral oval (i.e., the extent to which an aurora is visible directly overhead).These observations serve several different functions, including demonstrating where the aurora is currently being observed (Priedhorsky et al. 2012), providing data points for scientific investigation (Case et al. 2016), and providing the basis for a hybrid alert system (Lalone et al. 2015) that is analogous to disaster early warning systems (Tapia et al. 2014).
This study investigates the accuracy of volunteers in filtering useful data from a stream of tweets in an existing citizen science project.The results provide insights into the accuracy of volunteers in analysing Twitter data that may be applied to other citizen science projects.

Tweet Verification
Aurorasaurus exploits the Twitter Search API to identify publicly accessible tweets that contain any one of several different aurora-related keywords (e.g., "aurora;" "northern lights.")The returned tweets are then filtered further on the Aurorasaurus servers to exclude most retweets, tweets from Twitter users with "aurora" in their username (although a whitelist is maintained to allow tweets from some users to go through), and tweets containing profanity or other common "spam" terms.
A location extraction process is then undertaken on the filtered tweets.Location is determined either by using the embedded GPS metadata, if the Twitter user has opted to share their location, or through the geo-parsing software CLAVIN (https://clavin.bericotechnologies.com), which attempts to extract a location for a tweet based upon its text (D'Ignazio et al. 2014).Using these processes, approximately 15% of the tweets can be associated with a location (with extraction through CLAVIN accounting for approximately 80% of the associations).Further filtering takes place to remove tweets whose location is determined to be anywhere containing the term "Aurora" (e.g., Aurora, CO, USA).
These "unverified tweets" are then presented to the Aurorasaurus community for verification as pins on the main map and as a list on the "Verify Tweets" page (see Figure 1).The community is asked "Did they just see the aurora?"(where "they" refers to the tweet's author) and are provided only two choices for a vote ("yes" or "no").This subjective task allows automatic aggregation of the votes into a score and a classification based upon that score (Iren and Bilgen 2014).
For every "yes" vote a tweet receives, a value of 1 is added to its score.Conversely, for every "no" vote a tweet receives, a value of 1 is subtracted from the score.Votes from both registered and anonymous users are treated equally (i.e., there is no weighting applied to the vote based upon the user or their credentials).Once the tweet's score reaches a certain positive threshold (currently set to +3), it is categorized as a "positively verified tweet;" its marker is updated on the map to show this new status; and votes are no longer taken on it.Similarly, once a tweet reaches a certain negative threshold (currently set to -3), the vote is categorized as a "negatively verified tweet;" the marker is removed from the map; and the tweet is no longer presented to the community for verification.
To reduce the barriers of entry for users to start verifying tweets, no compulsory training is required.However, help in verifying tweets is provided by a pop-out help menu, which opens if the user clicks on the question mark in the tweet window (see Figure 1).Additionally, a blog post and quiz are available, both of which guide the voter through examples of tweets and how they should be voted upon.Approximately half the respondents to a recent Aurorasaurus survey indicated that they had read at least some of this guidance (Lalone pers. comm., 2015).

Results
This study analyzes the verified tweets posted during March and April, 2015.This two-month period represents a subset of the larger Aurorasaurus data set (which spans from November 2014 to present) and includes several  large auroral events, including the largest event this decade (Case et al. 2015).It is important to note that large auroral events, where an aurora can be seen from the mid-United States and central Europe, are relatively infrequent and are dependent upon several factors including solar activity, time of day/year, and local conditions (e.g., cloud cover).Additionally, an aurora can be a widespread phenomenon, with sightings of the same event spanning multiple continents (Case et al. 2015).
The distribution of the tweets and their verified status is shown in Figure 2. The number of each type of tweet ("total," "with location," "positively verified," "negatively verified," and "unverified") is shown by the filled bars.Note the logarithmic scale on the y-axis.
Each of the positively verified tweets was then independently manually inspected by two members of the Aurorasaurus team.This inspection involved analyzing the text of the tweets in detail to identify any signs of nonoriginality and to compare the location and time of the supposed sighting with auroral models and other citizen science observations.The verified tweets were categorized primarily into "valid" (where the tweet was indeed a real-time aurora sighting made by the tweet's author) or "invalid" (where the tweet was incorrectly positively verified by the users).Using an open-coding method, the following categories for the invalid positively verified tweets were created: • "Not real-time": a sighting of an aurora by the tweet's author, however, the tweet was posted at least several hours after the sighting took place (often the next morning).
• "Not original": the sighting was not made by the tweet's author (usually "retweets" or "mentions" of someone else's tweet).
• "Overlap": the sighting was not real-time nor was it made by the tweet's author.This would often be the retweeting of someone else's aurora photograph.
• "Wrong location": the location extraction algorithm (CLAVIN) failed to determine the location correctly.These failures are particularly difficult for voters to spot, because the location of the tweet is not shown on the tweet (see Figure 1).• "Not positive sighting": the tweet did not contain a sighting of an aurora but may have been related to one (e.g,."Seeing an aurora is on my bucket list").• "Junk": these tweets had nothing to do with an aurora (e.g., "Went to Aurora last night").
The distribution of these categories is shown in Figure 3.The second (orange) shows the number of tweets with an associated location and thus available for the Aurorasaurus community to vote on.The third (green) bar shows the number of positively verified tweets, while the fourth (red) shows the number of negatively verified tweets.The final (gray) column is the number of tweets that were not verified (i.e., "unverified").
Of the 475 positively verified tweets, 176 (37%) are valid.The precision, or positive predictive value (PPV), as calculated using Equation 1, of the positively verified tweets is therefore 37.1%.
where ΣTP is the number of true positives (i.e., positively verified tweets that are valid) and ΣFP is the number of false positives (i.e., positively verified tweets that are invalid).
The process was then repeated for a sample of the negatively verified tweets.This randomly selected sample included 475 negatively verified tweets (chosen to match the number of positively verified tweets).All but two of the tweets in the sample were correctly identified as negatively verified tweets.Thus, the "negative precision," or negative predictive value (NPV), as calculated using Equation 2, was 99.6%.
where ΣTN is the number of true negatives (i.e., negatively verified tweets that are not valid sightings) and ΣFN is the number of false negatives (i.e., negatively verified tweets that are actually valid sightings).
The overall accuracy of the verified tweets, in which all of the positively verified tweets and a same-sized sample of negatively verified tweets are included, can now be determined.Using Equation 3, the overall accuracy is found to be 68.4%.
where N is the total number of verified tweets in this sample (i.e., N = 950).Furthermore, these results can be broken up based upon periods of when auroral activity was particularly elevated (which is when most sightings would be expected to occur).Three such events occurred during this time period: March 01-03, March 17-19, and April 10-12.The distributions of the previous categories are shown, for each of these periods, along with the distribution of "non-elevated" periods, in Figure 4.
The negatively verified tweets also were split by storm period.Both of the invalid negatively verified tweets occurred during the March 17-19 storm (which is not particularly surprising due to the majority of tweets occurring during this time).The PPV, NPV, and ACC are calculated for each of these storm periods and are presented in Table 1.

Discussion
Approximately 17.4% of the 227,280 tweets collected during this case study had a location associated with them, which is consistent with other studies (e.g., Vieweg et al., 2010).Thus, nearly 40,000 tweets were available for the Aurorasaurus community to vote on.Approximately 75% of the locations obtained were determined using the CLAVIN geo-location extraction algorithm,   For each period, the percentage share of each category listed earlier is shown.
therefore, only a small percentage of the total tweets contained an embedded GPS location.Again, this result is consistent with other studies (e.g., Cheng et al. 2010, Lee et al. 2013).
The community cast more than 70,000 votes and verified over 4,500 tweets.The majority, around 80%, of verified tweets were negatively verified, i.e., the Aurorasaurus community voted that the tweet was not a real-time sighting of an aurora made by the tweet's author.This result is perhaps unsurprising, because it is only when auroral activity is high (which occurred three times during this case study) that increased numbers of people tweet sightings of an aurora (Case et al., 2015a).Indeed, the percentage of positively verified tweets (i.e.N pos /N) rises from around 20% during non-storm times to around 70% during active times (Table 1).
Notably, nearly 90% of tweets with locations went unverified (i.e,.they were not positively or negatively verified).These tweets are most likely not aurora sightings; rather, they are tweets that contain aurora-related keywords.However, we cannot be certain that this set of tweets contains sightings that have simply been overlooked.While this does not affect the accuracy of the verification system, it does mean that some scientifically useful observations, such as rare sightings during low auroral activity, might be missed.Further investigation into the exact nature of the unverified tweets, and what effect the number of unverified tweets may have on citizen science data collection on Twitter, should therefore be undertaken.

Verification Accuracy
The Aurorasaurus community was able to negatively verify tweets with extremely high accuracy.In fact, of the 475 negatively verified tweets analyzed, only two were incorrectly classified, resulting in an overall NPV of nearly 100%.The community was, however, much less accurate when positively verifying tweets.The overall PPV (or precision) was 37%, though significant variance occurred in the PPVs when splitting by event (with the highest PPV of 59% occurring during the March 01-03 storm and the lowest PPV of 27% occurring during the April 10-12 storm).At this time no reason is known for this variance unless it is attributable to differences in the sample sizes.
The overall accuracy of the verification system in this case study was 68%.Had all of the negatively verified tweets been analysed, and subsequently used in the accuracy calculation, the overall accuracy would probably have been much higher.However, because the number of negatively verified tweets was so much greater than the number of positively verified tweets, a representative sample was chosen instead.Note that the positively verified tweets (i.e., actual sightings) hold the most scientific value, so the PPV may be more important than the NPV or overall accuracy.

What affected the community's precision?
Spotting spam-like tweets that have nothing to do with sightings of an aurora is relatively easy.Much harder is differentiating between tweets that are real-time aurora sightings from those that are just related to the aurora or are true sightings that occurred several hours previous.Indeed, our analysis showed that the primary reason the community positively verified tweets incorrectly was that the community incorrectly identified the tweets as being real-time.
Identifying whether a sighting posted in a tweet is real-time can be a complex task, even for the Aurorasaurus team members.The tweet has a timestamp associated with it, but the tweet's author may be posting about a sighting that occurred several hours ago or perhaps even the day before.Unless the author explicitly uses words or phrases that chronologically identify when the sighting occurred, e.g., "just seen" or "spotted 10 mins ago," knowing exactly when the sighting occurred is difficult.In fact, even if the author includes a time, e.g., "aurora seen at 21:30," the verifier would need to know the offset between their current time zone and the time zone of the tweet's author to determine how long ago the aurora was sighted.Such detailed investigation is probably too much for most of the community to engage in, especially when they are voting on many tweets at once.
The second most common reason for incorrectly positively verifying a tweet was that the sighting was "not original."From this category we identified two themes: The tweet was of someone else's aurora photograph (85%) or the tweet was a retweet of somebody else's sighting (15%).Both of these errors likely stem from unfamiliarity with Twitter's nomenclature.For example, most of the "not original" tweets contained signs of the non-originality, i.e., the text "RT" (an acronym for retweet) or tagging of other users (which will always start with the @ symbol).We note, however, that many original real-time sightings may also tag other users, often as a way of alerting them, so this method to determine originality cannot be used on its own.

Improving the voting system
When the community incorrectly positively verifies a tweet we assume an "honest mistake" rather than a "cheater" (i.e., someone with malicious intent) because there is no gain to poor verification (Hirth et al. 2013, Iren andBilgen 2014).Therefore, a primary way to improve the accuracy of the crowd is to improve the information provided about the task and the desired outcome (Iren and Bilgen 2014).Aurorasaurus currently provides its community with instructions/guidance via a help page, blog post, and a quiz (where members of the community can test their voting skill and receive feedback on their choices).These are all "hidden elements," however, as a user may not have seen them before beginning to vote.Indeed, a recent survey of Aurorasaurus users showed that 40% did not know that instructions on how to verify tweets were available (Lalone pers. comm. 2015).
Enforcing training upon community members before they are able to vote has been shown to improve the quality of voting (e.g., Le et al. 2010).In some implementations, training results in a pass/fail that screens out untrustworthy or inaccurate users (Downs et al. 2010, Le et al. 2010).In others, the score attributed to each user's vote is weighted based upon how well they perform during the training (Sheng et al. 2014).We note, however, that these studies often employ contributors through Amazon's Mechanical Turk rather than volunteers in citizen science projects.
Because the Aurorasaurus project, like all citizen science projects, is reliant on volunteers, adding such compulsory activities might reduce the number of people who are willing to participate.Therefore, training that is not compulsory but that could be used to better inform the voting system on a user's trustworthiness might be desirable.For example, votes from anonymous users might be weighted to score 1, votes from registered users who have not taken the training might be weighted to score 2, votes cast by those who have taken the quiz but did not score highly might be weighted to 3, and votes from users who scored highly in the quiz might be weighted to 5. Project staff, or trusted super-users, might then have an even higher voting weight.This approach has the benefit of determining a pseudo-confidence level for each vote without erecting barriers to participation.Vuurens et al. (2011) demonstrated that a "combined consensus algorithm," which generally used a majority vote but then took into account the voters' trustworthiness in a tie situation, consistently provided the most accurate results.A tied result, with respect to the Aurorasaurus crowdsourcing system, would be where the number of votes is over the verification threshold, however, the score has not exceeded that threshold (i.e., 10 users vote-five yes and five no-resulting in a score of 0).
The training, and subsequent vote weighting, is likely to be a one-time effort (although, in practice, users could be allowed to complete it more than once).One-time training could lead to situations where users forget what they have been taught or their voting is affected by other factors (e.g., fatigue or lack of concentration).To help mitigate the effect of "bad votes" from a trained user, an adaption of the "majority decision" cheat-detection method (Hirth et al. 2013) could be employed.If a member of the community votes against the current majority decision or the decision of a trusted voter (e.g., staff or super-user), they are advised in real-time and offered training/ guidance on how they should vote.The frequency to which a user matches or does not match the majority can be stored, allowing a hybrid voting reputation to be built (Voyer et al. 2010).Based on this reputation, voting weights could again be applied.
In addition to improving the voting mechanism itself, another way to increase the quality of the verification process could be to improve the chance of a tweet being a valid sighting before presenting it to the community for validation.The current system simply uses a set of keywords for searching and another set for filtering.Machine learning, based on either a gold standard set or the community's voting, might improve the quality of the tweets being served to the community (Wang 2010, Becker et al. 2011, Truong et al. 2014).This approach was tested early in the Aurorasaurus project, however, it failed to yield any noticeable improvements (MacDonald, pers. comm. 2015), indicating that further refinement may be needed on such an approach before it could be applied to this task successfully.

Conclusion
Like many citizen science projects, Aurorasaurus is heavily reliant upon a community of volunteers for providing data and for validating/classifying data.To complement the aurora sightings reported directly to the project, Aurorasaurus also systematically searches for observations of an aurora posted on Twitter, using the Twitter Search API and several rudimentary filters.A location is required for all sightings, so those tweets that do not contain an embedded location are passed through a location extraction algorithm that attempts to resolve a location for the tweet based upon its text.This process, while not always accurate, increases the number of usable tweets four-fold.Using a similar location extraction process is therefore recommended for other citizen science projects needing location data from tweets.Including Twitter as a data source has increased the number of observations for the Aurorasaurus project by nearly 100%.Exploiting Twitter as an available data source is therefore recommended for other citizen science projects that collect observational data.
Twitter observations are noisier than traditional citizen science reports, however, so they need more curation by both the volunteers and project staff.The Aurorasaurus community is therefore encouraged to verify these potential sightings using a simple crowdsourcing scoring system.The community is rewarded for its participation by a leader board, where each vote earns the volunteer 5 points, and by increased accuracy in localized auroral visibility alerts.
This Aurorasaurus case study has shown that volunteer citizen scientists are extremely adept at filtering out spam-like tweets and other non-aurora sightings.These tweets tend to form the majority of tweets presented to the Aurorasaurus community, especially during times with little auroral activity.For the random sample studied, the NPV of the "negatively verified" tweets was almost 100%.A good NPV is perhaps unsurprising, as filtering spam is a relatively easy task, though such a high score was somewhat unexpected.The volunteer community proved to be less accurate when identifying the true aurora sightings.The PPV, or precision, of the positively verified sightings was somewhat poor at 37%.The most common reason for the community incorrectly positively verifying a tweet was that the tweet was not real-time, followed by the tweet not being an original sighting.
While positively verifying tweets requires more detailed investigation than filtering out spam-like tweets, the PPV achieved certainly could be improved.As discussed, incorrect identifications were likely the result of honest mistakes, so the primary way to reduce them is to provide training for the community.Aurorasaurus does provide some training, although it is not compulsory.The "verifying tweets quiz," which is the only interactive training offered, is detached from the verification process in that it is a completely separate entity and is not linked in the "help" pop-up text (see Figure 1) when verifying tweets.Making any training compulsory will likely reduce the number of users who then participate in the verification process (Lintott, pers. comms. 2015).This is a quality-control cost that many projects must deal with (Iren and Bilgen 2014).However, small improvements, such as providing a link to the quiz during the verification process, are likely to increase the community's accuracy, even if just a little, without affecting the number who are willing to participate.
Larger, systematic improvements, such as implementing vote weighting algorithms or the adaption of a real-time majority decision cheat-detection system, are likely to significantly improve the quality (particularly the PPV) of the community's verification efforts.Such improvements will take time and resources to implement but should be on the future road map for the project.
The results of this case study suggest that other citizen science projects that plan to use volunteer crowdsourcing for data validation, especially for "noisy" data (e.g., tweets), should consider using some of the training or qualitycontrol methods that we describe here.The information provided on Twitter by citizen scientists, and then verified by other volunteers, can be extremely useful.However, consideration must be given to training those volunteers who validate the data or else the accuracy of the crowd may be poor.

Figure 1 :
Figure 1: a) An example tweet as presented to the Aurorasaurus community for verification.The volunteers are asked "Did they just see the aurora?" and are given the two simple options of "yes" (for a positive, real-time, aurora sighting) or "no." b) Once a threshold positive score is reached, the tweet is confirmed as a "positive sighting" and becomes known as a "positively verified tweet."It is then no longer available for further voting.

Figure 2 :
Figure 2: The distribution of tweets collected during March and April 2015.The first (blue) bar indicates the totalnumber of tweets collected.The second (orange) shows the number of tweets with an associated location and thus available for the Aurorasaurus community to vote on.The third (green) bar shows the number of positively verified tweets, while the fourth (red) shows the number of negatively verified tweets.The final (gray) column is the number of tweets that were not verified (i.e., "unverified").

Figure 3 :
Figure 3: The distribution of positively verified tweets collected during March and April 2015.The tweets are grouped by the previous categories: valid (green), not real-time (red), not original (yellow), overlap (orange), wrong location (blue), not a positive sighting (black), and junk (purple).

Figure 4 :
Figure 4: The positively verified tweets have been split into three active auroral time periods and one non-storm period.For each period, the percentage share of each category listed earlier is shown.

Table 1 :
Tweet numbers and verification accuracy, split by periods of auroral activity.