Open Access Research Infrastructures are Critical for Improving the Accessibility and Utility of Citizen Science: A Case Study of Australia&rsquo;s National Biodiversity Infrastructure, the Atlas of Living Australia (ALA)

Erin Roger; Dax Kellie; Cameron Slatyer; Peter Brenton; Olivia Torresan; Elycia Wallis; Andre Zerger

Introduction

Globally, citizen science has experienced massive growth and national support over the past few decades (; ; ; ), made possible by the availability and advancement of technology such as smart phones and machine learning, that allows millions of people to contribute to science (; ). Citizen science growth has also been supported by global efforts to open up science, such as UNESCO’s Open Science Recommendation (). Compared with other applied sciences, citizen science is particularly dominant in the environmental and ecological sciences (; ), with the most common form of citizen science participation being the collection of species occurrence records, or observations of one or more species in a geographic place and time (). While technological developments have facilitated substantial improvements in the ability to record observations, the rapid pace of this growth has brought about new challenges for citizen science information management. The benefit of citizen-contributed data is most evident when these data are integrated with other datasets to create “big data” that is subsequently used to make large-scale science-based decisions, to answer ecological questions, and to solve problems by helping make sense of the data (; ). Therefore, the ability to access and process data is increasingly important as advances in digital technology result in ever-larger datasets, and citizen science grows in its popularity, both for participants and as a way to collect data.

On a project or application basis, however, creating custom infrastructure to standardise thousands, or even millions of data points each day becomes a complex and constant task that can incur time, cost, and maintenance overheads. To avoid these overheads, practitioners are increasingly interested in established frameworks and infrastructure for designing, managing and communicating projects (). Research infrastructures (RIs) are facilities providing resources and services to support research and innovation at large scales (). In the field of biodiversity data, RI can be defined as a collection of digital tools and services that allow users to submit, query, extract, and analyse big data (). RI has the ability to amplify the value of small citizen science projects by aggregating multiple data types (e.g., observational, bio-acoustic, genetic, media) into larger units of analysis, enabling users to look (for example) at population trends or distribution, thereby enhancing the scientific value of the citizen science contribution. In the field of biodiversity informatics, emerging evidence supports that established RI eliminates key barriers of closed and siloed data by making data open and accessible, and increase access to information on biodiversity (; ).

These findings emphasise that RIs are vital enablers of citizen science, facilitating more robust workflows, data standards, and data quality (). We suggest that RIs are citizen science enablers by: 1) aggregating data from multiple citizen science projects and platforms; 2) combining citizen science with sources, such as collections and scientific survey data; 3) providing the technological capacity to handle the enormous influx that some citizen science sources (e.g., biodiversity recording apps) generate; and 4) making data centrally available to research and decision-making in a consolidated form (; ). Investment in well-designed and well-resourced digital RI is, therefore, crucial to improve trust and to assure scientific quality of environmental citizen science as it continues to grow globally (, ; ).

Major global RI with substantial citizen science biodiversity contributions include the Atlas of Living Australia (ALA), Global Biodiversity Information Facility, the Living Atlases, iDigBio, LifeWatch, iBOL, and eBird (). Other broader domain examples include eu-citizen.science (), which is a pan-European platform for sharing citizen science projects, resources, training, and tools, and CitSci.org (), a United States–based multi-project data collection platform specifically developed for citizen science data collection. Many smaller platform examples also exist to support individual projects in collecting and distilling biodiversity information (e.g., ). Aggregators regionally may include citizen science records—or citizen science records that have undergone additional validation—in their datasets, but there are relatively few examples (namely those listed above) at national or continental scale that support open source citizen science projects with collection, discoverability, and aggregation ().

Here, we select one digital research infrastructure, the ALA, and present an overview of how it supports environmental citizen science at a continental scale (i.e., Australia). To demonstrate the value of RI for citizen science, we visualise the volume of data supplied through citizen science in the ALA over time. We then provide five species distribution maps to demonstrate the value of large data aggregation in presenting a richer picture for both common and range-restricted species. We also present statistics for government and industry usage to provide evidence of the benefits of aggregation and discoverability. We present some of the common challenges of citizen science, and detail how the ALA provides data services and platform development to help address them, as well as providing cost benefit by aggregating services for the benefit of multiple users. Our aim is to demonstrate the value of sustained, long term RI support to maximise the potential of citizen science.

The Atlas of Living Australia (ALA)

Established in 2010, the ALA is Australia’s national biodiversity aggregator. It collates environmental data from varied sources including museums, herbaria, government monitoring programs, research projects, Indigenous knowledge, and citizen science projects and platforms (). The ALA also delivers research services allowing data download, visualisation, manipulation, and ecological analyses through a “spatial portal” to more than 80,000 Australian and international users annually. ALA end users include communities, non-governmental organisations, government, industry, and research institutions. The ALA was established on open access principles () and has developed processes to ensure that its infrastructure is accessible to users, connected with international databases, and interoperable with many online data services. Approximately 40 staff work for the ALA with a budget of over $5 million (AUD) annually. The ALA receives most of its funding from the National Collaborative Research Infrastructure Strategy program () and the Commonwealth Science and Industrial Research Organisation (), with additional revenue obtained through partnerships with other government departments and initiatives. At a global scale, the ALA engages in technical and strategic collaboration with (and is the Australian node for) the Global Biodiversity Information Facility (GBIF), an international biodiversity data repository (). To support other countries, a community of national atlases based on the ALA’s open-source code has also been developed. Twenty-seven Living Atlas sites are now online with more under development (). At present, the ALA contains over 115 million species occurrence records of more than 150,000 species (including animal, fish and reef species, and land and marine plants) (). About one quarter of research and monitoring projects that supply data to the ALA are citizen science (Figure 1a), and about half (50.4%) of all species occurrence records in the ALA are derived from citizen science projects (Figure 1b). These numbers demonstrate the enormous contribution citizen science makes to overall biodiversity records in Australia.

Figure 1

Proportion of (a) research or monitoring projects that provide data to the ALA that use citizen science or not, and (b) the proportion of species occurrence records derived from citizen science or non–citizen science projects. Data displayed includes both publicly available and embargoed project data held in BioCollect as of February 2022. The BioCollect platform is an event-based data recording system in which individual recording events can yield many occurrence records. Total numbers of species occurrence records from citizen science projects in BioCollect projects were estimated by aggregating the counts of embargoed and unembargoed occurrence records for each project.

Growth and utility of citizen science data in the ALA

Although the ALA has grown to provide more services that support citizen science, there has been limited analysis of the extent of growth of citizen science data within the ALA and its implications. To explore the proportion of data contributed by citizen science in the ALA between 2010–2021 (Figure 2), we extracted the number of openly available species occurrence records supplied to the ALA from research or monitoring projects using the statistical computing program “R 4.1.1” and the code-based galah package (; ). Projects were categorised into citizen science and non-citizen science by matching the project name with dataset categories. Projects that did not match were manually categorised by checking project descriptions on respective project websites. We defined non–citizen science data sources as those that have only experts or professionals collecting data and do not typically engage non-expert public members to do so (e.g., museum and herbaria specimens, professional monitoring programs). (Though we acknowledge that citizen scientists may also be contributing specimens to museums and thus potentially also contributing through this mechanism.) Code to reproduce our figures can be found on the Open Science Framework ().

Figure 2

Proportion of the total number of records added to the Atlas of Living Australia from 2010 to 2021 collected using citizen science or non–citizen science methods.

Here we show that while the comparative contribution of citizen science–gathered observational records grows in volume over time, sources of specimen-based data (e.g., from museums and herbaria) are increasing at a slower rate (Figure 2). That specimen-based collections data should grow in volume more slowly than observational records is unsurprising, given the additional labour involved and that the rate of species discovery has slowed. It does indicate, however, that the number of species occurrence records contributed through citizen science can supplement data from museums, collections, and professional monitoring programs. Similar trends were reported in Heberling et al. ().

To demonstrate the value of large data aggregation, we downloaded available ALA records of five species: the shingleback lizard, Peron’s tree-frog, red-browed firetail, Richmond birdwing, and Black Rockcod. The first three are widespread, commonly encountered species (N > 10,000 records) with varying detectability, whereas the last two are range-limited species (N < 1,000).

The shingleback (Tiliqua rugosa) is a recognisable large ground skink that occurs across much of the continent. This is the only taxon in the group that does not have one or more major citizen science projects monitoring it. Peron’s tree-frog (Litoria peronii) is a southeastern Australian frog and subject to one major national citizen science project and several longer-term regional projects. The red-browed firetail (Neochmia temporalis) is a less immediately recognisable small bird occurring in forests and woodlands of eastern Australia and subject to two major avian citizen science projects, one that has been running since the early 1980s (). The last two species are range-limited species of variable detectability: the Richmond Birdwing (Ornithoptera richmondia) is a showy threatened butterfly from central eastern Australia, and the Black Rockcod (Epinephelus daemelii) is a less obvious threatened marine fish occurring down the east coast and in several external Australian territories.

We partitioned the data into citizen science and non–citizen science (using the project categorisations used for Figure 2). For T. rugosa, L. peronii, and N. temporalis we display locations where there are more records from one source than another (Figure 3). For T. rugosa (Figure 3a), distribution data was strongly driven by professionally acquired data (state agencies and museums), with 18.5% of the distribution principally being contributed by citizen science. Effectively, the two sources are complementary, with aggregated citizen science records providing supplemental data on distribution that would otherwise be underrepresented by data from only non–citizen science sources. We suspect the absence of a major reptile-focused citizen science program to be a major factor in this result. However, FrogID (), a major citizen science project in Australia, has profoundly influenced the distribution of L. peronii (Figure 3b). Citizen science records contribute the majority of available records for around one-third of its distribution area (33.0% of hexagons). Across the continent, citizen science is the major source of L. peronii data overall. Combining citizen science and non–citizen science data sources has major potential implications for population monitoring in species such as this. Tulloch et al. () noted the requirements for citizen science to usefully contribute to population monitoring, and it is pertinent to note that the bulk of the citizen science data collected within the FrogID project has been constrained to expert identification and accurate time and geocoding in line with these requirements. For N. temporalis, subject to multiple, large-scale, mature citizen science projects, this trend is continued (Figure 3c). Over 86.5% of hexagons are principally contributed by citizen science projects, and the scale of the avian dataset (close to 50% of the total records in the ALA) demonstrates the ability of research infrastructures to combine and interrogate multiple large datasets. Again, the data show that over much of the species range it is large volumes of citizen science data that contribute the majority of information on occurrences, which could inform population monitoring.

Figure 3

Distribution of (a)Tiliqua rugosa (shingleback skink); (b)Litoria peronii (Peron’s tree frog); and (c)Neochmia temporalis (Red-browed firetail). Map displays locations with a greater number of species observation records collected using citizen science methods (green) or non–citizen science methods (purple). A darker colour of hexagon corresponds to a greater difference in the number of records between citizen science and non–citizen science records. Record counts are pseudo-log transformed to allow for log standardisation of both positive and negative numbers.

The results for O. richmondia and E. daemelii (Figure 4a,b), alternatively, show the contribution of RI aggregated citizen science in the case of species with limited distribution information (N < 1000 records). In both examples, aggregated citizen science data has contributed to at least half of the data for these species, and significantly added to the spatial coverage of known locations. In Figure 4a, aggregated citizen science records (particularly iNaturalist [] and Butterflies Australia []) contribute 55% of data for O. richmondia and add additional new locations in the north, centre, and south of the total range. In Figure 4b, aggregated citizen science data comprises 63% of observations for E. daemelii, particularly the Australasian Fishes project (within iNaturalist) and Reeflife Survey (). The citizen science data significantly infills known locations for the central third of the species range.

Figure 4

Distribution of two range-limited species, the (a)Ornithoptera richmondia (Richmond Birdwing); and (b)Epinephelus daemelii (Black Rockcod). Map displays observations of both citizen science and non–citizen science records as well as the total number of records for each species by reporting category.

Distribution data on threatened species such as the above examples underpin conservation management decisions and the environmental approvals process for development. Internal ALA data analysis of the period 2016–2020 show the importance of aggregation to end users (C. Slatyer, unpublished data). Government and business downloads from the ALA average around 565,0000 records per annum, with over 98% of downloads for the purposes of environmental assessment. Of these, 2,000 downloads per annum were from a wide range of business sectors; however, over 72% came from businesses engaged in natural resource management or environmental assessment. Government usage was even stronger. About half of this usage (N = 2,959 per annum) represents users from all of the land management agencies (federal and state/territory) in Australia. The number of land management agencies that use the ALA is 187. This demonstrates that by acting as a discoverable aggregator, RI exposes citizen science data to both reporting and policy-making in a demonstrably meaningful way that contributes directly to researchers, decision-makers, and public users in a way that individual projects cannot match.

Addressing common challenges to citizen science

Data quality

Citizen science can generate large volumes of data over space and time. However, much of the reluctance associated with the use of citizen science data within the scientific community surrounds uncertainty about its accuracy (; ). For example, observers vary in skill and proficiency in species identification, which can affect the reliability of their data (). Variation in skill may also influence how broadly observers identify species, resulting in taxonomic biases, with experts more likely to specialise in one particular taxonomic group than non-experts (). Many citizen science projects also function independently and sometimes do not adequately ascribe metadata to describe the datasets and methods (). Trust and credibility in the accuracy of citizen science is vital for its uptake as a reliable data source for long-term ecological research and informed decision-making.

To begin to address concerns over data quality, the ALA undertook a project focused on improving how to better display data attributes. In the ALA’s occurrence data repository, we made data-quality attributes visible and any uncertainty about taxonomic and spatial accuracy of aggregated data more explicit through the use of automated data filters (). This project led to the creation of a toolbar that enables users to filter species occurrence records on the ALA website’s search results page according to data-quality criteria. The toolbar improves the visibility of metrics and metadata that allow users to assess whether data are fit for purpose, and allows users to exclude records based on system flags such as location uncertainty, environmental outliers, duplicate records, scientific name quality, and spatial quality issues. The system flags were generated by the ALA and produced by picking up on spatial inaccuracies (e.g., outliers) as well as employing basic statistics to address discrepancies. By creating data-quality filters, the ALA has added more information for users to understand what constitutes high-quality data while continuing their support to data providers to improve collection and curation of data. A full description of the automated data quality tests is available ().

Data standardisation

The data management needs of citizen science programs vary widely. Some projects record additional data along with species occurrences that have additional data requirements. For example, some projects collect multiple observations for a single collection event, record additional event and observation-level metadata, and provide data for specific uses, species, or maps (). These different types of data collection result in data that require different data structures, including auxiliary environmental data, survey effort information, species attributes, and site characteristics (). It is not unusual for projects to need more than one type of survey and, therefore, require multiple descriptive schemas. RIs must be able to support these differing survey and data structures through adequate standardisation methods for these data to be stored, combined, and used correctly.

The ALA’s data model is underpinned by Darwin Core, the most common global standard for exchanging biodiversity occurrence data (). Darwin Core enforces that data are supplied in a consistent format with standardised column names that allow many sources of data to be aggregated efficiently (e.g., using Ecological Metadata Language files, ). The global biodiversity informatics community is targeting the need to provide standards and vocabularies for augmenting stand-alone observation data with the evidence used to assert those observations, including the measurement of effort. Long-term solutions that meet both local and international needs are being explored in partnership with several data communities via GBIF’s Unified Data Model (). The ALA is developing a new system for navigating ecological survey event data that includes new vocabulary terms developed in collaboration with global partners.

The intention of these data standards is to promote ongoing data quality and to ensure that citizen science data are consistent for research and government use. Improving how we quantify uncertainty with citizen science data is essential to its use. In the case of citizen science, uncertainty originates from methodological errors or biases in data collection, classification, or processing, as well as from expected natural variation in ecological data (). Tracking uncertainty may ensure that the related variables and biases are findable, accessible, interoperable, and reusable (). Confidence in input data (and complexity) is required by governments to meet legislative requirements or for listing species using the IUCN Red List () criteria, for example. One way to make uncertainty transparent is to integrate uncertainty associated with species identification into a data standard like Darwin Core, making this metadata easier to find when aggregated by RI like the ALA. Species distribution modelling can tolerate and account for certain levels of error in the modelling data (), but data aggregators have a large role to play in making complexities about data uncertainty clearer to continually improve confidence and fit-for-purpose use of citizen science data in the future.

To address the needs of complex systematic surveys that record hierarchical data that link observations to samples, field sites, and surveys over time, the ALA developed BioCollect (), a universal field data collection web-based platform that supports both citizen science and non–citizen science users. BioCollect allows people to create projects with one or many surveys, and enables the surveys to be configured according to the specific requirements for each given application. Configurable parameters include: spatial, temporal, and taxonomic scope of the survey; how sites can be created; who and how people can participate in the survey; how records can be verified; and selection of the data schema itself. Attribute-level mappings to the Darwin Core standard are built into each schema. In this way, BioCollect stores custom data structures that link multiple species observations to a single survey event, along with contextual information on the site, sample, or event itself. An example of citizen science activities in Australia using the BioCollect platform include the Waterwatch program () (additional examples at ).

BioCollect has also been used by numerous Indigenous ranger communities to standardise and aggregate their field expedition data and has assisted in collective implementation of common data collection protocols. Citizen science activities with Indigenous people requires careful consultation with communities and consideration of the CARE and FAIR principles () to ensure that data sovereignty remains in the hands of custodians. Platform configurations for these projects enable each ranger community to preserve its own data sovereignty and to manage information sharing. These are first steps, however, much remains to be done to develop more equitable and inclusive means of displaying and recognising the enormous data holdings already within ALA contributed by Indigenous peoples, such as biodiversity observations.

Sensitive species data

RI support for citizen science creates the opportunity for standardised solutions and improved real-world outcomes across multiple projects. However, there are a wide range of ethical and scientific dilemmas around open-access species data and methods (; ). For example, healthy populations of sensitive species can be potentially jeopardised if exact locational details are exposed (i.e., species at higher risk of over-collection, damage, or disturbance) (). Most government agencies and collections manage this risk by withholding or obscuring record locations, which, while necessary, may counterintuitively inhibit good conservation outcomes by limiting availability for research, community, or government action (). ALA has found lists of sensitive species are rarely universally agreed-upon between data custodians, leading one data provider to protect species locations that another data set is exposing. This is a challenging issue. For example, despite the common list of sensitive species adopted between iNaturalist and the ALA based on Australian expert-derived or statutory lists, these lists are often not consistent with third-party expert-derived lists. Effectively, an unscrupulous individual can compare exposed public datasets and find enough unobscured points to locate a narrow range endemic species. Another issue is multiple obfuscation, where data custodians restrict or blur the exact location of sensitive species coordinates. Different data custodians rarely use consistent methodologies for obfuscation, and in some cases, do not include metadata to indicate obfuscation has been applied. Sometimes, obfuscations are unknowingly applied a second time in aggregated datasets leading to the same observation occurring at multiple geographic locations. These data issues remain at best only partly addressed by most large data aggregators and citizen science projects.

The ALA actively acknowledges data issues with sensitive species and obscures locality data for those species appearing on government-provided sensitive species lists as well as on third-party sensitive lists (as is the policy for other large data aggregators). These filters remove the need for individuals to provide their own data sensitivity methods in favour of a consistent national approach. The ALA is leading a multi-partner project to develop an agreed-upon national framework and data service for the handling and sharing of sensitive species records by developing a shared set of definitions of what can be called sensitive data, national protocols for sharing data, common approaches to obfuscating data publicly, and a centralised national point from which to request data from multiple custodians. We anticipate once in place, these will greatly enhance the Australian data custodians’ ability to share and analyse sensitive species information, a large percentage of which has been contributed by citizen scientists.

Aggregating tools and platforms—cost-benefits to small projects

Funding for citizen science has prioritised the development of new or scaled citizen science projects, usually through individual, modest project grants (). While this approach has successfully increased the number of projects, it has also created hundreds of bespoke, disconnected applications and platforms that can hinder the usefulness of citizen science datasets and can create duplication of effort (). Sometimes tools such as basic spreadsheets can be satisfactory for a citizen science project’s data collection needs, but these tools rapidly reach practical and functional limitations that eventually require project owners to seek more sophisticated solutions. Developing and managing customised project-specific platforms, or user interfaces for viewing and downloading data, is a costly and time-consuming undertaking. It also creates confusion amongst citizen science communities as to what are the best tools and platforms to adopt for projects. Many platforms devise their own unique data “standards” for data collections and processing. Without using a recognised data standardisation procedure, the result of this is many siloed, small datasets that cannot be used alongside one another. Projects set up in this way reduce the value proposition for citizen scientists who are motivated to contribute to a greater scientific endeavour. Using established research infrastructure platforms can therefore provide a viable alternative to creating custom project-specific platforms for citizen science data. We have provided a few examples below of how we have partnered to prevent duplication and to support open access tools and platforms.

The ALA has provided support for record-level observations since going live in 2010. At that time, there were very few Australia-specific mobile applications and tools for citizen scientists to easily record observations. In May 2019, the ALA began collaborating with iNaturalist, a globally leading biodiversity recording platform for citizen science, to form iNaturalist Australia (), a local node of the iNaturalist platform. iNaturalist allows participants to record opportunistic observations of any living organism with a date, time, and spatial coordinates via evidential photos and/or sound recordings (). This partnership has made it easier to report biodiversity observations to “research grade” (a qualified data quality standard based on an observation of a species in the wild with multiple the number of confirmations of species identification). Around 8,278 million observation records of more than 416,000 species have been added to the ALA via iNaturalist () as of June 2023, and this number continues to grow weekly.

The Australian Citizen Science Project Finder () was developed by the ALA to fulfil an increasing demand from Australian communities for a searchable catalogue of citizen science projects. The goals of the Australian Citizen Science Project Finder are to: a) improve project discoverability; b) increase public participation through improved project discovery; and c) minimise project duplication, thus saving time, effort, and cost for participants. Since its launch in 2017, the Finder has grown to include 646 projects, with 518 of those listed as “active.” Through Application Programming Interfaces (APIs), the database of projects is also shared with other citizen science project search engines like SciStarter (scistarter.org), where projects that are not geographically bound (such as online projects) are made available to an international audience.

Finally, to provide a crowdsourcing solution to digitising records, the ALA in partnership with the Australian Museum developed DigiVol (), an online crowdsourcing platform with the goal of increasing accessible environmental biodiversity data through digitisation of historical records (e.g., specimen labels, hand-written field notes, and journals). DigiVol is an open-infrastructure source application that allows institutions from all over Australia, and globally, to create opportunities for volunteer citizen scientists to contribute to transcription and/or data capture from images for scientific research by transcribing analogue historical records into structured digital data. This includes museum and herbarium specimen labels, field notebooks, diaries and journals, and more recently, identifying animals in camera trap images and via sound recordings. Institutions or organisations can create an expedition (project) on DigiVol, thus preventing the duplication of investment in similar platforms by individual institutions and condensing effort into one standardised platform. Since its development in 2015, the number of registered citizen scientists using DigiVol has increased to more to more than 9,000 individuals.

Discussion

We have shown that the ALA derives more than half of its observations from citizen science projects (Figure 1), and have revealed the rapid growth of citizen science records since 2010 (Figure 2). By selecting a mix of common and range-restricted species we then showed that RIs act as a citizen science enabler (Figures 3 and 4) by aggregating and combining formal data sources to provide a much richer picture of species distribution. Our data breakdowns show the potential contribution of citizen science to understanding population size by providing data over significant proportions of species ranges (although there was variability in the strength of this contribution amongst species). The ALA has worked to address common criticisms of citizen science data by implementing tools to make data quality more explicit, to create platforms that standardise complex survey data structures and ensure that relevant metadata is recorded with datasets, to safely store data on sensitive species, and to establish ways for people to collect and discover citizen science data.

Our findings demonstrate the importance of investment in digital open access research infrastructure to optimise the scientific value of the citizen science movement globally. The value of digital RI lies in its low-cost accessibility and in its ability to implement national frameworks that target data gaps in a robust form across multiple projects at a continental scale (see for a conceptual framework). We have provided several examples of how RIs remove barriers to submitting and accessing citizen science data by addressing weaknesses like inconsistent metadata and data quality through standardisation. Although there are costs associated with it, by harnessing the potential of citizen science, investing in digital RI represents a high-impact opportunity for Australia’s biodiversity research and management with a relatively low operation cost.

Next steps for research infrastructure

Many citizen science projects function independently. However, a balance is required between supporting data collection needs of individual citizen science projects and supporting the data robustness needs of citizen science as a whole (e.g., integrating data, synthesising data across programs). For example, as a result of each project’s independence, there may be lack of consistent metadata to describe the datasets and their methods adequately to people outside of the project (), an issue that eventually flows on to RI if merged. If citizen science projects communicate their data management practices to large aggregators, then data quality can be assessed by what is appropriate for the data type (). The utility of citizen science data may be improved by establishing more universal criteria for metadata with the goal to synthesise independent project data into RI; this should be a focus of citizen science projects going forward.

The ability of RI to ingest, standardise, and store data from many unique sources make them ideal for improving methods for species identification from images, videos, and sounds (). Projects like iNaturalist have demonstrated how emerging technologies like machine learning can improve data quality in species identification, which could be applied across multiple projects and platforms (). Citizen science can also draw on social media (), where content shared publicly outside of dedicated citizen science platforms can be used to contribute photographic records of species for identification (). By integrating data from diverse sources, research infrastructures facilitate the ability to capture richer information that may promote more complex scientific outcomes (). For example, RI have a role in supporting disaster resilience and recovery by helping to direct effort to under-surveyed or priority taxa for management action post disaster (). In order to support the sector, RI must be built to consider large data processing needs and requires high throughput processes with adaptable common infrastructure elements that can reduce project costs while improving data standards, consistency, and the ability to feed data into research and decision-making.

Despite the ALA’s success in supporting citizen science so far, the diverse nature of data types and their structural format has created digital management challenges for maintaining data standards, for sharing these standards with data providers, for ensuring proper attribution and privacy, and for budgeting for added service costs. For example, the principal data exchange standard for occurrences, Darwin Core, is not static but undergoes regular reviews and updates. Keeping both database formats and vocabularies up to date is a challenge for all data providers. There is also a need for improved consideration and policy of Indigenous data, and a question of how to better recognise and safeguard the data holdings contributed by Indigenous people (). A fundamental issue for research infrastructure providers is to deliver tailored services around individuals or project-level/community needs as well as to develop workflows and standards that can assist in best-practice data management (). The integration across data and platforms is crucial to enable more effective meta-analyses across citizen science projects, and to minimise duplication of projects and programs. With the adoption of global targets () and sustainable development goals (), country-led national initiatives that can be aggregated and linked to global facilities are becoming all the more critical in order to meet global targets and reporting requirements. RI remain the best mechanisms to help countries meet these global challenges.

Most RIs have automatic data pipelines that harvest data directly from other online platforms and portals, thereby negating the need for users to familiarise themselves with multiple interfaces. However, barriers to access still exist, with the relatively slow rate of ingestion of the vast amount of information from non-digital formats an issue. For example, many museum and herbarium specimens, as well as field data collection sheets, have yet to be digitised. There are initiatives to fund data mobilisation schemes for groups where time and money is often the overriding digitisation constraint. Lack of familiarity with online portals and data entry mechanisms are other barriers to RI, as well as a lack of skills in the manipulation and analysis of large datasets, and online access constraints, which are likely to manifest along sociological and economic lines (). Despite these challenges, RIs remain the best places to remove barriers to citizen science; removal may include providing data in a format where barriers to use, such as incomplete metadata and concerns over data quality, are addressed (). Digital RI have a responsibility, though, to continue to invest in improvements to intuitive user interfaces to try and minimise barriers to use and access as much as possible.

Conclusion

The past 10 years have witnessed impressive growth in the contribution of citizen science to biodiversity data, with citizen science now a permanent and expanding feature of the biodiversity data science landscape. Citizen science remains one of the most effective mechanisms for organisations to bring citizen amateur scientists closer to professional science and to make the results of science universally available (). As highlighted by the ALA case study, citizen science contributions can grow further with enough support over time from RI and national frameworks. RI cost-effectively provides data from multiple citizen science projects in a centralised digestible form, which is being picked up by government and business as well as by researchers. The benefit of citizen science data is most evident when these data area integrated with other datasets to create “big data.” Future opportunities exist to harvest citizen science contributions from new sources and to take advantage of new technologies that improve data quality and data aggregation. As such, RIs are an important tool for recognising and amending biases (taxonomic and spatial) in data collection over time. RIs have a role in helping to design national programs that are informed by known gaps and science needs (). A new frontier for citizen science could be working through processes and guidelines around how we can direct citizen science efforts to operate in a national framework around filling data gaps. The advancement of citizen science is interconnected with the advancement of digital research infrastructure, and resourcing both will ultimately lead to greater scientific value and use of citizen science data.

Data Availability Statement

The data and code are openly available on the Open Science Framework at https://osf.io/kh5rv/. We have added a renv.lock file in the OSF repository which records the R environment and package versions used to make the figures. This information was also saved as a text file – renv_text – as a reference document which can be opened and read within the repository.

Citizen Science: Theory and Practice

Essays

Open Access Research Infrastructures are Critical for Improving the Accessibility and Utility of Citizen Science: A Case Study of Australia’s National Biodiversity Infrastructure, the Atlas of Living Australia (ALA)

Abstract