Introduction

The 2030 Agenda for Sustainable Development was adopted by 193 countries in the 70th Session of the United Nations General Assembly. With the adoption of this Agenda, the United Nations member states and multilateral organizations recognized the need for data to monitor the progress towards the 169 targets and 17 goals. This global framework for the Sustainable Development Goals (SDGs) is composed of 231 unique indicators (adding sub-indicators and disaggregation, the total number of indicators goes beyond one thousand) (). In this context, policy needs to be agreed upon as the most relevant global Agenda until 2030, increasing the demand for quantitative measures on a scale not previously experienced, causing additional challenges for national statistical systems (NSSs) worldwide, and opening opportunities for innovations in sources and methods to satisfy additional data needs not met with traditional sources and methods.

Considering the additional data needs for SDGs adopted at a global level and defined as part of the national SDGs policy, defined in the Public Policy Document CONPES 3918 of 2018, the National Statistical Office of Colombia (DANE, acronym in Spanish) has been working with custodian agencies on strategies to fill data gaps (). Two of these identified gaps are SDG 16 indicators: SDG 16.b.1, Proportion of the population that declares having personally felt victim of discrimination or harassment in the previous 12 months on grounds of discrimination prohibited by international humanitarian law (); and SDG 16.7.2, Proportion of population who believe decision-making is inclusive and responsive, by sex, age, disability, and population group ().

DANE has deployed a strategy to measure some aspects of these indicators, using traditional sources like the Victimization Survey; the Coexistence and Citizen Security Survey, carried out every two years; and the Political Culture Survey (ECP, by its acronym in Spanish), the latter of which has become a barometer to measure the perception of the impact of public policies on the consolidation of democracy in the country (). However, the periodicity of this information affects a wider description of the discrimination phenomenon in the country, as no annual data is available from DANE, and no other national institution reports this information for the forms of discrimination as defined in the metadata. This also occurs in other parts of the world. Defined as “any distinction, exclusion, restriction or preference or other differential treatment that is directly or indirectly based on prohibited grounds of discrimination, and which has the intention or effect of nullifying or impairing the recognition, enjoyment, or exercise, on an equal footing, of human rights and fundamental freedoms in the political, economic, social, cultural or any other field of public life” (), only 31 countries have reported information on discrimination over the period 2014–2019. In that period, one in five persons reported having personally experienced discrimination on at least one ground of discrimination prohibited by international human rights law ()—on the one hand, a high number, considering the number of countries that report. On the other hand, this number of countries represents a challenge in terms of the global report of the phenomenon.

According to Nicolas Fasel, chief statistician at UN Human Rights, “States need to tackle discrimination more comprehensively and address its overlapping and cumulative forms as well as its consequences on everyday life. The collection of disaggregated data, using a human rights approach is a first step that can go a long way to tackling this” ().

Given this context, citizen-generated data like social networks represent a potential statistical data source for the measurement of this phenomenon. In the Colombian case, the adoption of social networks such as Facebook is high, as much as the internet penetration in the country. According to the National Quality of Life Survey, in 2021, internet usage in Colombia corresponds to 79.9%, and the frequency of internet use for 76% of people aged 5 and over corresponds to every day of the week. In that year, there were 39 million users of social media in the country (around 78% of the total population), and the percentage of people from 14 to 64 using Facebook as its main social platform was 91.4%. In 2021, the potential Facebook audience was 36 million people ().

This use of citizen-generated data from social media is understood as a wider set of processes and techniques, which includes the well-known concept of citizen science (), which is defined herein as “people, who are not professional scientists, taking part in research, i.e., co-producing scientific knowledge. This involves collaborations between the public and researchers/institutes but also engages governments and funding agencies” (). Therefore, citizen-generated data from social media shares some of the same issues as citizen science, such as the needs for scientific standards, for ethical considerations, and for data management, among others (). These concerns are as old as citizen science is. According to Droege (in ), public participation in scientific research, at least for fields such as bird watching, dates back to the end of the 19th century and the beginning of the 20th century, and it includes participation in different stages of the scientific work such as collecting, processing, and analysis, as well as the assessment of the results.

It is worth noting that some ethical considerations arise from embracing this definition: Citizen-generated data, as defined above, including data collected from informed and uninformed citizens, poses questions on data privacy, on the right of the public to be informed about the use of data, and on the active role citizens play in the new conception of scientific endeavor. The poor quality of results and statistical relevance are among the other concerns in this field ().

It should also be emphasized that social media is a space where people can express themselves freely. It is usually thought of and used as a place to make people feel free to express themselves, air their grievances (), engage in self-identification in a broader public sphere (), and find community through online contact () in order to create safe spaces from discrimination, especially for population groups that historically have been victims of bigotry, like the LGBTIQ+ community (). But these groups also feel discrimination attitudes at distinct levels in social media, as posed in Lucero (), even online intergroup contact makes individuals more sensitive to detect discrimination (). In this context, discussions on Facebook regarding discrimination might reflect the perception of people in the offline world, as is suggested by Marciano and Antebi-Griszca (), including possible associated mental health issues.

Because of this, social networks like Facebook can be one of the tools to understand marginalized communities (). One of the unique factors of internet communication is anonymity (), which created a protective environment for people to express themselves. The authors write, “the protective cloak of anonymity allows people to share aspects of the self online with far fewer costs and risks” (). In a quantitative study, Mancini and Imperato () found similar results, for online intergroup contact on Facebook makes people more attentive to detecting sexual discrimination.

However, little research has been done to address the problem of discrimination as a natural language processing problem in social media. Some studies are qualitative and focused on several population groups: For example, social media has been used as a new space to humiliate the Dalit community in India, with hate speech used against them, and legal repercussions following (); and there has been discrimination against the Muslim community on Facebook (). Another study is that of Ben-David and Matamoros (), who have studied political violence and its different characteristics in Spain, including discrimination. This focus is based on Latour’s network-actor theory, in which humans and nonhumans have agency which implies that different technological devices such as the like and share buttons play a prominent role in the identification of the different aspects of political discrimination in Spain.

From the quantitative point of view, the literature related to hate speech is vast, but the differences between this subject and forms of discrimination are not clearly defined. It is worth mentioning the paper by Marciano and Antebi-Griszca (), in which different types of discrimination (e.g., political or sexual identity) are identified as prevalent in several contexts like Facebook interactions, online dating, and the offline world. This is contrary to the results of Lucero (), who reports that the LGBTQ population feels this social network is a safe place to interact with some other members of the community. Mancini and Imperato () also used Facebook as their data source, studying the behavior of different online groups in that network to understand the process by which online intergroup contact makes individuals more sensitive to discrimination (p. 8). Brooks, Shmargad, and Williams () researched the discrimination directly from the algorithms, studying how the bias, the lack of data, and the audits inform a clear picture of how data systems and algorithms could, in fact, make discriminatory decisions against people.

As can be seen so far, no study addressed the use of Facebook as a statistical data source for official information about discrimination, especially as a data source to estimate SDGs indicators, first and foremost following the people’s lived experience in the definition of metrics associated with SDGs ().

Therefore, the question that motivates our study is whether Facebook is a useful and feasible source from which to generate official statistics, both broadly speaking and specifically on discrimination. To address this question, we propose a deep learning methodology to obtain complementary measurements for both 16.b.1 and 16.7.2 SDG indicators from Facebook data, which can be used to contrast and complement information for Colombia’s Political Culture Survey.

The remainder of this paper is organized as follows: The Methods section explains in detail the proposed method and strategies. The Dataset section presents the dataset and preprocessing strategy. Experimental Evaluation shows the experimental evaluation of the method for both SGD indicators 16.b.1 and 16.7.2, the experimental setup, results, and discussion. Finally, Conclusions and Future Work reports the main conclusions and future work.

Methods

The concepts, tasks, and models associated with language modelling are vast and they include different discrete and probabilistic models such as n-grams models, vector semantics, neural language models, and deep learning approaches to language processing. Key to any of these methods is the notion that data quality is dynamic and changes as the data undergo transformations. The same data can be both an output from a data source, as well as a source of data. Therefore, the methodology consists of two principal components. On one hand, data collection concentrates on measures taken to assess and increase the quality of the data collection process. It concerns the resilience and reliability of data gathering procedures and the fitness–for–purpose of the source of data for the relevant analysis. On the other hand, data quality assessment consider the reputation or believability of the source of data in question, as most data quality assessment methodologies do. This includes aspects of privacy and data access for scientific purposes.

Our methodology concentrates on the quality of language classification models employed to extract information from the text in Facebook posts and comments. This is a significant data quality bottleneck when working with social media data, given the lack of large, labeled datasets and the high level of entropy in language data available in social media. This methodology seeks to address these constraints by providing a framework for more accurate and flexible language modeling that can at once be used to generate large, labeled datasets more affordably and quickly. It is worth noting that some of the activities of our methodology follow the guidelines of the CRISP – DM process () and its newer updates ().

Data collection

The primary data collection method is data scraping. This technique is based on automated browsing that allows for simulation of a user’s behavior and collection of the data visualized on the screen (). The tool proposed as part of this methodology is a Facebook automation bot that only collects posts and comments from public Facebook pages and profiles. The bot is coded in the Python programming language and uses web browser automation software to browse Facebook. The profile pages selected for data collection were chosen based on their relevance to the political environment in Colombia and curated manually to represent diverse viewpoints and backgrounds.

It is important to note that no data from private profiles were collected, following the ethical and privacy considerations associated with the use of citizen-generated data for academic and statistical production. This aspect was also analyzed considering the Colombian legal framework, specially the 1993 Statistical Act and the Decree 2404 of 2019 that define the concept of alternative data sources for statistical production, which includes social media, and establish its conditions of use ().

Strategy

Annotated Spanish-language datasets for several types of discrimination were not available during the time of the experiment. Furthermore, the cost of building a large custom dataset to train neural networks for discrimination detection was prohibitive in the context of the exercise. Therefore, pre-trained large language models were used for text classification, in a technique called zero-shot text classification. The pre-trained large language model was a version of the popular BERT neural network, which was already trained on massive quantities of Spanish-language text. The model used has an accuracy of 79.9% for textual entailment (determining whether two statements are contradictions, entailments, or neutral to one another) and topic classification. These scores were obtained using the popular XNLI-es dataset. A small subset of the data was then sub-sampled and annotated manually, to measure the performance of the zero-shot approach.

In order to minimize the potential impact of inaccurate predictions using the pre-trained model, outlier analysis and benchmarking are carried out using the confidence scores for each prediction made by the model. An adequate confidence threshold is determined to ensure model confidence for discrimination classification, as explained in detail in the section “Outlier analysis.”

For the labelled comments approach, a random sample from the original dataset was extracted and manually annotated by DANE’s experts. The labelling took place in three iterations, each by a different annotator, in order to ensure an appropriate agreement between annotators reviewed by Cohen’s Kappa Score as explained in detail in the section “Labeling.”

Zero-shot model

Zero-shot learning (ZSL) is a machine learning paradigm whereby a pre-trained model is used to predict labels that it did not explicitly see during the training process. The zero-shot classification model () extends inference to new categories without prior explicit semantic information. This methodology used a version of the BERT neural network that was finetuned on the Spanish portion of the XNLI dataset.

Outlier analysis

The purpose of this analysis is to obtain a confidence coefficient for the filtering and selection of those classifications in which the model is more certain to belong to a specific category. To obtain this coefficient, those values above three standard deviations over the mean of the classification probability coefficients are identified. Once these values (outliers) have been detected and isolated, the median of this set is calculated. This coefficient will be selected as the candidate threshold of confidence threshold when ensuring that a sample classified by the model corresponds to that category.

Labeling

For both 16.b.1 and 16.7.2 SDG indicators, the same tagging strategy was defined. This strategy consists of randomly extracting n samples from the comments dataset and tagging the same comments by three different annotators until an acceptable inter-annotator agreement is obtained. This inter-rater agreement level is identified through the calculation of Cohen’s kappa coefficient, which is a statistic used to measure inter-rater reliability (and intra-rater reliability) for qualitative (categorical) items.

The Label Box platform was used to generate collaborative annotations. Three independent projects per annotation set containing the same samples were created. This was to ensure independence between annotators to minimize tagging bias.

Dataset

Scraping

The scraped dataset contains 771,502 records of public Facebook comments, obtained from different users, mainly from Colombia. The dataset presents wide variability, seeking to mitigate the latent bias given by the very nature of the data source. To achieve this objective, 66 profiles of public figures were considered, and categorized as follows: artists, economy, government, mayors, news, politics, public opinion, public order, sports, and others. From these profiles, posts were collected between the periods of June and October 2021. Once these posts were obtained, comments on these posts made between July and December 2021 were also collected. The selection of these collection periods has no reason beyond the periods of data collection during the research.

In addition, from this main set, a sample of 1,000 random comments was filtered to obtain 541 testing samples used as “ground truth” for evaluation purposes. Finally, within the anonymized version of the dataset, variables such as the date on which the post was made, the text of the post, and the user’s comment were included.

Preprocessing

A standard preprocessing strategy was used, consisting of special character removal, Unicode symbol removal, and lowercasing. This preprocessing was performed both on the text coming from the post and the text corresponding to the users’ comments.

Experimental Evaluation

Experimental setup

Since a pre-trained zero-zhot model was used to generate predictions, the default values with a raw representation of the information were used to set up a classification baseline. Two exercises were carried out corresponding to the SGDs indicators under study: perception of discrimination and representativeness, respectively. In addition, each one of these exercises has two sub exercises in which the proposed target labels were as follows:

  • – Discrimination:
    • – A: “discriminación económica,” “discriminación política,” “discriminación racial,” “discriminación por ser migrante,” “‘discriminación por discapacidad,” “discriminación por orientación sexual,” “discriminación por ser mujer,” and “discriminación no evidenciada.”
    • – B: “discriminación económica,” “discriminación política,” “discriminación racial,” “discriminación por ser migrante,” “discriminación por discapacidad,” “discriminación por orientación sexual,” “discriminación por ser mujer,” “discriminación por sexo,” “discriminación por edad,” “discriminación por estado de salud,” “discriminación por rasgos físicos de su cuerpo,” “discriminación por lugar de residencia,” “discriminación por credo,” “discriminación por estado civil o condición familiar,” “discriminación por identidad y pertinencia cultural,” “discriminación no evidenciada.”
  • – Representativeness:
    • – A: “esto es inclusividad política” and “esto es receptividad política.”
    • – B: “tengo algo que decir sobre el gobierno” and “los políticos escuchan lo que tengo que decir.”

The predictions were generated using a Nvidia GeForce RTX 3080 card with an approximate execution time of 16.03 hours for each proposed exercise. Taking the 771,502 records available, a total of 503,553 users have been identified but only 8,177 (corresponds to just 1% of the total records and 2% of the identified) users have been selected for the analysis, based on the outlier’s analysis mentioned above (see the section “Outlier analysis”). This is due to the performance of the model, with a low metrics associated with discrimination as is shown below (see the section “Model performance”).

For the case of the indicator SDG 16.7.2, a total of 405,693 users were identified and 219,372 (54% of the users) were included once we applied the outlier’s analysis.

For the second discrimination exercise as well as for the two representativeness exercises, the following metrics were calculated: accuracy, precision, recall, and F1-score. In the case of the precision, recall, and F1-score metrics, their respective equivalents were also calculated: micro, macro, and weighted (since the problem does not consist of binary classification). In addition, the confusion matrix is obtained to visualize the classification distribution in terms of true positive (TP), false positive (FP), true negative (TN) and false negative (FN).

Model performance

The full evaluation set reported in the section “Scraping” was used for each of the proposed exercises. Predictions for the perception of discrimination and political representativeness were generated and compared against the annotations made by the experts. Figure 1 shows the confusion matrix obtained for the 16-label discrimination exercise.

Figure 1 

Discrimination confusion matrix with 16 labels. True labels correspond to ground truth values. Predicted label (corresponds to predicted values).

As can be seen, the largest amount of information is originally classified as non-evidenced discrimination, where the model is confused by identifying this type of discrimination as political discrimination, discrimination based on identity and cultural relevance, and discrimination based on creed, mainly at 15.8%, 15.71% and 14.66% respectively. In addition, the classification results obtained for this exercise are summarized in Table 1.

Table 1

Metric results for discrimination performance.


METRICTYPE

MACROMICROWEIGHTED

Accuracy0.2047

Precision0.08730.20470.7822

Recall0.69580.20470.2047

F1 Score0.04850.20470.2625

From this, it is confirmed that the imbalance of existing classes confounds the base model, putting on evidence the need to deepen the exercise with a larger number of balanced samples, in addition to a re-training of the same, seeking to specialize the model in the domain of information under study.

For the external political efficacy case, Figure 2 shows the confusion matrix obtained for the two exercises. These exercises had as a differential the use of two (2) distinct sets of labels for the zero-shot model, corresponding to: “esto es inclusividad política—esto es receptividad política” and “tengo algo que decir sobre el gobierno—los políticos escuchan lo que tengo que decir.” The first one associated with political inclusiveness and the second with political responsiveness.

Figure 2 

External political efficacy confusion matrix with the second set of labels. True label corresponds to ground truth values. Predicted label (corresponds to predicted values). Exercise A (left), Exercise B, (right).

The classification results obtained for these exercises are summarized in Table 2.

Table 2

Metric results for political external efficacy performance for both exercises.


EXERCISEMETRICTYPE

MACROMICROWEIGHTED

AAccuracy0.4586


B0.8947

APrecision0.5180.45860.9478


B0.47980.89470.9235

ARecall0.62260.45860.4586


B0.46480.89470.8947

From this, it could be concluded that the second set of labels on the base model allows for a better classification of political representativeness. However, as in the case of discrimination, when considering the other metrics, it is necessary to specialize the model in this information domain with a considerable sample of examples, to improve the results obtained so far.

Production of Indicators

Users have been defined as all who have made a comment, categorized under any of the types of discrimination. There may be cases of users who made comments on more than one form of discrimination, so each of these facts should be considered as a particular case, even if the author of the comment is the same. In this way, no associated information was lost, and the recommendation of the methodology is as follows: “The indicator should be a starting point for understanding patterns of discrimination” (). Table 3 shows the percentages of users whose comments were labelled by the model as discriminatory.

Table 3

Discrimination types (in percentage) disaggregated by users.


TYPE OF DISCRIMINATIONABSOLUTE VALUESPERCENTAGE

Religion6507.95%

Disability1111.36%

Economic5006.11%

Age3063.74%

Civil status150.18%

Cultural identity94011.50%

Migrant condition300.37%

Women140.17%

No_identified4325.28%

Political opinion4.53355.44%

Physical aspects4775.83%

Ethnicity120.15%

Place of residence500.61%

Health condition871.06%

Sex200.24%

Total users8.177100.0%

As it is presented in Figure 3, the highest proportion of users whose comments include discriminatory language correspond to political discrimination type (55.4%), which can be explained due to the political context during that period, which included demonstrations, mobility restrictions in the main cities of the country, and the highest peak of deaths and infection rate by Covid-19. The next types of discrimination in importance were cultural identity (11.50%) and religion (7.9%). The categories with the lowest participation were ethnicity (0.1%) and sex (0.2%).

Figure 3 

Discrimination comments by user and type.

To get a proxy value of these results comparable with the results of DANE’s ECP, the proportion of users whose comments were associated with some of the recognized types of discrimination were calculated; this proxy indicator arises from the quotient between users whose comments were associated with some of the types of discrimination over the total number of users. (See Equation 1). In this expression, Usersprob0.5 corresponds to the total number of users with discrimination-related comments whose probability was higher than 0.5, Userstotal represents the total users with discrimination-related comments, and Usersdiscri corresponds to the proportion of users whose comments were associated with some of the recognized types of discrimination.

(1)
Usersdiscri=Usersprob0.5Userstotal*100.

In this case, the value is 1.9%, which means that 1.9% of the users have made some comment with a high probability of having discriminatory content.

This excludes users in the non-evidenced category since they are comments that the model cannot assign to any of the forms of discrimination, although it cannot be affirmed that there is no discriminatory content in them. The comparative results are presented in Figures 4 and 5.

Figure 4 

Proportion of people who felt discriminated against, by sex. Colombia’s Political Culture Survey.

Figure 5 

Supervised analysis. Proportion of types of discrimination.

The comparison with the proportion of users with comments related to discrimination confirms that there are differences in the prevalence of the types of discrimination in the two sources observed, thus for users of social networks such as Facebook, the type of discrimination that has more comments was politics, whereas types of discrimination associated with age and economic discrimination were observed in the ECP.

For SDG indicator 16.7.2, as shown in Figure 6, inclusive decision-making has a significant prevalence. 79.5% of the users made comments that the model has been associated with the latter. The difference between men and women is short: 44.7% of comments made by men were labelled as inclusive, compared with 34.8% of comments made by women labelled as inclusive. A similar proportion is observed in responsive decision-making.

Figure 6 

Proportion of users by sex and types of political external efficacy.

As it is presented in Figure 7, the official statistics from Colombia show a similar tendency: The ECP show a higher percentage of inclusive decision-making for the people who respond to the ECP, and show a very short difference between both sexes (21.0% of men consider their decision-making process inclusive, whereas 20.1% of women consider theirs inclusive). The same proportions are presented in responsive decision-making.

Figure 7 

Colombia’s Political Culture Survey. Political external efficacy by sex.

Conclusions and Future Work

For the discrimination case, low classification accuracy is observed. When at first analyzing the confusion matrix, a low variability is found for the actual labels. In the second instance, the model identifies types of discrimination as political, cultural identity and belonging, creed, physical features, economic, racial, and age, as well as non-evidenced discrimination. From these results, it could be inferred that the model has a certain bias for political and creedal discrimination.

Regarding indicator 16.7.2, although the results for Exercise A show a low performance, better results are obtained for Exercise B, which presents similar results with the ECP. This suggests that for external political efficacy with a zero-shot base model, the best way in which the reference labels are presented to the model corresponds to Exercise B. Considering both the discrimination case and the external political efficacy case, it is necessary to perform a fine-tuning process to obtain better domain adaptability.

Based on the proposed method results, it is possible to estimate a proxy indicator of SDG 16.b.1 because of the closeness between the obtained value for the Usersdiscri and the affirmative response percentage by the different types of discrimination, as analyzed in ECP. However, a key difference in the most prevalent types of discrimination between the ECP and the exercises was found. The main difference between the ECP and the exercises was in the type of discrimination by age, although the types of discrimination related to the economic situation and political opinions are amongst the more prevalent types in both measurements. Therefore, the estimation should be made with caution, noticing that bias and the differences identified in the types of discrimination could affect the results.

From the above, the conceptual differences in the capture of the phenomena between the official information and the results of this research play a prominent role in further researches on the subject.

In the ECP, the types of discrimination of which people have been victims are based on guidelines posed on the United Nations indicator metadata, whereas in this analysis, comments related to discrimination are only generally identified based on the semantic features found by the applied language models, and more research based on these models is required to identify victims of discrimination. Therefore, ECP and a language processing model approach should be understood as two different points of view to describe different aspects for the discrimination phenomenon. This conclusion suggests that more studies are needed to ensure data representativity, particularly concerning the approximation to identify victims of discrimination and the comparison of the traditional and nontraditional data for producing statistical information.

Hence, these results should be considered as contextual or complementary information, presenting the specific dynamic of social media in which people reveal their situation related to discrimination. If so, it is important to establish that the indicators generated as part of this research could not be considered as an official estimation for discrimination or for the SDGs indicator. Similar to the 16.b.1 SDG indicator, the 16.7.2 SDG indicator results obtained with the proposed method are comparable, and provide context to the ECP.

These findings are consistent with the literature on citizen science challenges, as shown in Pateman and West (): “Citizen science could make contributions in three types of process linked to the SDGs: defining national and subnational targets and metrics, monitoring progress and implementing action.”

In addition, a few methodological challenges were identified. First, the design of a robust methodology for stable data ingestion, which includes use of bots or crawlers in Facebook. Second, the development of dummy accounts is also an important factor to increase the stability of Facebook data capture. This considers that the longer an account is active and shows favorable behavior, the less likely it is to be closed. Finally, increasing the number of profiles, posts, and comments extracted to add demographic indicators such as age and gender that guarantee data representativity and improve the quality of the metrics.

These challenges posit a key question regarding citizen science and the use of citizen-generated data: How can we assure the quality of data? Based on these results, it is not enough just to count on a well-designed and implemented methodology () to resolve these issues. A comprehensive analysis of data should be included the methodology, there should be an audit process, and the follow-up should be based on statistical standards (e.g., the Generic Statistical Business Production Model). As stated in Fritz et al. (), “The quality of data from citizen science can be evaluated using the same measures as any other official data… This includes measures such as positional and thematic accuracy, temporal currency of the data, completeness and representativeness over space and time, and whether the data are fit-for-purpose.” This must be accompanied by capacity-building in the institutions and a fluent dialogue with the Civil Society Organizations that work on the subject.

Therefore, it can be concluded that Facebook data is not a feasible official data source given the various challenges presented and the estimated numbers for both indicators. Given that no representativity can be assured, no further comparison with the current data can be made. This reduces the scope of this data source to be a contextual data source but not an official one in NSOs.

Aspects associated with citizen participation, as it is understood in the main definitions of citizen science (), must also be considered. In the case of the labelers in the supervised analysis, they received training in technological devices as well as in the definition of the project’s main concepts, for at least two different periods. This improved the responsiveness indicator results, but the improvement does not occur with discrimination indicator 16.b.1, although the labelers received the same training for both indicators. This kind of lesson argues in favor of being more open to including different types of collaboration involving citizen participation, but a strictness when a methodological approach is required. According to Pateman and West () “when a study is well designed and implemented, the quality of citizen-collected data is, in fact, comparable to that collected by professional scientists.”

Our research shows that the data collection process from social networks also raises ethical concerns in two aspects: the use of citizen-generated data from social media as a relevant data source in scientific research, and the use of “black box” models and the bias they have (). The issues connected to opacity and bias in machine learning models have brought to light the need for more transparency in the designing of algorithms and the data used for training to prevent or mitigate adverse effects. According to Franzen et al. (), black box is a system “in which we can observe the inputs and outputs but not the internal process. Machine learning algorithms like neural networks and deep learning are so intrinsically complex that it is virtually unworkable to get to the bottom of their operations and internal decision-making processes.” One of the reasons for using models such as zero–shot and BERT models in this project was these are well known, and their technical details are widely worked by different researchers. In addition, its use is transparent in the sense Franzen is referring to: “The idea behind explainable AI radiates from the implementation of algorithms that are understandable to a human expert who can discern the internal mechanisms and understand what is happening” (). In this very way, the scripts, notebooks, and a manual were written and disseminated to explain how this exercise was made.

However, some semantic and linguistic considerations on the BERT model could not be studied in terms of the accuracy of these semantic relations between the comments and the types of discrimination, and how we can mitigate the bias behind it. This is an open question, and more studies are required. In this project, supervised analysis, in which labelers reviewed comments, was the main strategy employed to reduce the possible bias of these models. Consideration of this subject remains an open question, too, and it could be a promising research line because of its impact on the production of official statistical information.

However, using data from scraping raises serious concerns about the data privacy of social media users, hence the discussion of this topic in the development of the project. It was also considered that the use of social media data and administrative records present similarities. For instance, these data sources are created with a specific purpose and not necessarily for statistical production. Decree 2404 of 2019 states, for the case of administrative records: “The data protection and information security conditions of the microdata custodian shall be prioritized. The parties involved in the exchange shall guarantee that the information shall not be used for purposes other than statistical and shall maintain confidentiality” (). The same criteria could be applied for social media data since both share various features such as unstructured formats, automatized data collection, high velocity and volume, and, in some cases variability. It also presents differences: Users give their data because it is mandatory (like taxes or health registers), but social media data are shared voluntarily in that network. In both cases, data privacy and confidentiality are required to guarantee that the statistical information for the public is trustworthy.

Given this confluence, DANE considered social media as a potential alternative data source of statistical information, and fosters its usage in a mandatory fashion () or in a voluntary one, as suggested in the National Code of Good Practices ().

Based on the latter, the project also establishes the use of social media for another goal—to create technical capacities to use deep learning models for improving statistical processes. In that sense, data scraping for social media should be understood, too, as a technical device to collect data, as it is stated in some paradigmatic legal cases. One example is Sandvig v. Sessions in the United States District Court for the District of Columbia (), in which scraping was considered in that sense. According to Mancosu and Vegetti (), “[s]craping is merely a technological advance that makes information collection easier; it is not meaningfully different from using a tape recorder instead of taking written notes, or using the panorama function on a smartphone instead of taking a series of photos from different positions.” This consideration was also addressed in the project, based on Facebook Terms of Service and in the current legal Colombian framework.

It is worth noting that new projects related to citizen-generated data are being developed in DANE, where citizens do have an active role in the data collection; hence, a comparison could be made to evaluate the best approach to work with citizen-generated data and with citizen science in general. A more active, participatory approach could be more fruitful, based on the experiences of other countries ().

Therefore, the deep learning approach using transformer models like the zero-shot model, represents a starting point to study different SDG indicators associated with perception or information retrieval from both citizen-generated data and citizen science perspectives.

Further research could be conducted in three ways. First, produce a model retraining both for discrimination and for representative domain specialization to address and possibly enhance the obtained results based on the carried analysis. This entails the adequacy of the estimation formula and the proposed method in this research. Second, launch a new approach, broadening the data source, as alternative sources such as Twitter might be more feasible to tap into. Finally, produce a model complexity analysis in order to evaluate model alternatives for discrimination classification.

In terms of public policy, the feasibility of these kinds of alternative sources should be explored in other SDG indicators. Going forward, researchers should try to assess the data quality and strength of the methodological design by taking into consideration the role of citizens in conceiving statistical production for the 2030 Agenda. To this end, they should develop guidelines for using social media data and citizen-generated data for statistical production in general, fostering the participation of civil society. This is the necessary next step to broaden the scope of the citizen science in Colombia.