Methodology, Data and Definitions of the CivicLytics Platform

CivicLytics allows you to understand complex civic issues by listening to the perceptions and concerns of millions of citizens in Latin America and the Caribbean in real time. We process and analyze open civic data that citizens express on the Internet (social networks, blogs, chatbox). From the IDB Group, we add this information by supporting governments, citizens, and the private sector to drive inclusive solutions to local and regional recovery.

Here you can read more about how the data is collected, how the data is processed, what to consider when using the data, and further definition of terms used.

The platform is powered by Citibeats, a text analytics platform specialized in social understanding. More information can be found at citibeats.com.


Data collection

Data sources

Data is collected daily from online conversations in publicly available sources, including Twitter, online forums, news comments, and blogs – in English, French, Spanish, Portuguese and Dutch for 26 countries.

We are continuously adding data sources. If you have a suggestion of a data source to add to the initiative, please let us know here.

Normalization and sampling

People’s opinions data from publicly available sources requires normalizing, sampling and cleaning to make it usable, and even then, we must be aware of the limitations.

Since each country has different population sizes as well as different levels of internet access and participation in sharing opinions online, we need to make them comparable. We normalize the data by ensuring that whenever countries are compared, it is always by relative proportion of the captured conversation per country.

Sampling is primarily used in order to control the amount of data we process (rather than for comparability purposes, which is covered by normalization). Sampling is determined by the data ‘query’ we use to define which opinions are collected from the data sources. In this case, the query contains broad COVID-19 keywords, adapted for every language. This does mean that if someone shares an opinion that is implicitly related to COVID, but not explicitly mentioning COVID (or closely related keyword), it will not be included in our sample. Any significant changes to sampling will be updated in the change-log, and users of the API will be automatically notified.

Representativity

The data is representative of populations that use online sources to share opinions about COVID-19. It is difficult to give precise representativity information, since opinions do not contain demographic information (which can only be inferred) and utilization of data sources per demographic varies significantly by country. It should be noted that, as a global summary of our data, women are under-represented (this information is disaggregated and differences made visible in Gender Gap), as are elderly populations and low-income populations.

This social listening platform is intended as a sensing tool and early warning system, but this representativity limitation must be kept in mind.

Country level statistics about internet penetration, platform use and demographics can be found at: www.datareportal.com.

Privacy

All data is presented anonymously and aggregated. Anonymous means no names of the authors of opinions are shared. Aggregated means we show summary statistics, rather than individual opinions, so no raw text is shared publicly. Moreover, it should be noted that this opinion's data comes from publicly available data and not sensitive private data.

Geographic attribution of data

Country attribution of the data depends on the data source.

In some cases (such as Twitter), this is self-reported in the profile information of users. In others, this is inferred from country level top-level domains (e.g. ‘.co.uk’ for the UK), or from local references mentioned in the text. This should be noted as a limitation of the precision of the data.


Data processing

Categorization

Once the data is collected, it is categorized, or classified, into one of the defined categories.

The categories have been defined as topics of interest by health information experts, as well as through a bottom-up analysis of the data. Categories may be adjusted, added or removed during the initiative, which will be notified via the API portal and github.

Data is categorized automatically, with human quality controls. This is achieved through semi-supervised machine learning. This means that from initial human-inputted examples defining a category, the system learns and infers which opinions belong to that category. Regular human review ensures quality control.

The categorization system learns and infers in each local context (in this case, in each country), to adapt to terminology and references made in each country, accounting for differences in language use and social context.

Gender gap

Female and Male data is estimated using aggregated and anonymized profiling (e.g. from names, bios). More can be read about Citibeats estimated gender disaggregation here.

Intent detection

It is possible to filter results by ‘intent’. Intent refers to opinions shared with a particular purpose - in this case, we are monitoring ‘questions’ and ‘complaints’. Intents are automatically detected by the Citibeats system, based on machine-learning models and fine-tuned to the context of COVID-19.


How should this data be used?

Intended use

The CivicLytics Platform has been designed with public policy professionals in mind, who need regular (typically weekly) snapshots of the public conversation.

Change log

Please be aware that depending on how the conversation evolves, category definitions may be changed, or new categories added (thereby changing relative proportions of the conversation), for the analysis to stay relevant.

Please note that if such changes are made, it would break the consistency of the analysis. For example, if we started with 14 categories in month 1, and added 2 new categories in month 3, since we are working with proportion of the conversation, it would not be entirely consistent to compare month 3 proportions of conversation for a category which has appeared during all months. Any such changes will be documented in the github.

Frequency of data updates

Data will be updated daily, at 6:00 am UTC.

Differences between opinions data and objective health data

Since the CivicLytics Platform is intended for use by public policy professionals, it is important to reflect on the differences of this type of data compared to typical data types analyzed by the community.

Most importantly, opinions data are just that - subjective opinions. If the top category for a given country in the platform is ‘Category X’, it does not necessarily mean that ‘Category X’ is the topic that health professionals should consider the top priority. Category X may be the most mentioned by people, but not necessarily be the most important to them; furthermore, if Category X is the most important in the minds of the general public, it may not necessarily be the most important to the public health community. These are signals that information professionals should use within the context and their knowledge of the current situation.

Differences between public big data and surveys

Whereas in a survey the questions asked and answers collected are generally structured, that is not the case in analyzing people’s opinions from public big data, which are unstructured. The benefits of analyzing social big data is that it is real-time and has large geographic and topic coverage. This should be kept in mind - this approach is suited as a ‘sensing’ or ‘early warning’ system, rather than a precise measurement tool.

Level of depth of information

The CivicLytics Platform is intended as a straightforward resource for public policy professionals. For deeper analysis of public big data for your country, you may consider setting up your own social listening platform.


Definitions

Category definitions

01 Governance

Citizen reactions, doubts and criticisms of the measures to contain the pandemic and the reactivation of the economy. It reflects the responsibility of institutions to implement them and anticipate exit opportunities from the crisis.

02 Food Safety

Perceptions regarding the risk of shortages of food and basic supplies (water, energy), as well as the risk of famine as a result of the loss of income among the most vulnerable.

03 Household economies

Monitoring the impact of the crisis on the domestic economy, reflecting the risk of poverty for those most affected. It includes testimonials from employees, unemployed, self-employed and citizens. It also shows the perception of the gender gap in the workplace.

04 Markets and SMEs

Requests for economic and fiscal policy based on the perception of the impact of the crisis on the markets, reactivation of SMEs and larger companies.

05 Health system

Perception of the capacity of the health system (including personnel, health infrastructure and authorities), to face the pandemic. Includes comments on the lack of medical supplies and access to treatments and vaccines.

06 Citizen security and risk

Effectiveness of protocols by spaces, companies and groups to avoid contagion. Citizen security in urban displacements in the "new normal".

07 Consumption options

Consumption habits, limits and changes: non-basic goods, mobility, tourism and leisure in the new normal.

08 Employment and inequality

Perceptions about the future of work: digitization of companies, consolidation of teleworking and barriers to it. In addition, it measures labor problems such as unemployment, irregularity, precariousness and gender inequality at work and new opportunities.

09 Access to education

Impact of the crisis on education, from childhood to university, as well as the loss of educational level due to the digital divide and isolation, especially among rural populations and women.

10 Mental health

Cases and concerns about changes in mental health due to the crisis, uncertainty and future prospects. Attention to differences in impact by gender.

11 Vulnerable groups

Groups or individuals in vulnerable situations; obstacles to their economic and social reintegration: women, immigrants, groups of indigenous peoples, Afro-descendants.

12 Climate change impact

Impact of the economic decline and displacement due to environmental emergencies that aggravate the crisis of vulnerable groups, families, companies (floods, natural catastrophes, fires, deforestation).

13 Reactivation and innovation

Civic and business initiatives and innovations that respond to civic needs in the new normal, generation of new jobs and opportunities.

14 Fake news

Dissemination of false health or political information, manipulating public opinion to induce panic, chaos or discriminate against minorities.

‘Opinions’

An ‘opinion’ is considered to be a unique contribution. We are not including social interactions (e.g. retweets, likes, shares) in our analysis.

‘Top category’

‘Top category’ shows which category contains the most opinions, compared to other categories in that country. Values are proportions (%) of the conversation, where all the categories sum to 100% for each country.

‘Rising’

Shows in which country the selected category is a ‘rising priority’. ‘Priority’ refers to the proportion of the conversation for that category, compared to the other categories in that country. ‘Rising’ means the change in priority, comparing the last 7 days with the 7 days prior to that.

It is important to note that ‘rising’ here is relative to the other categories in that country. For example, if Country A doubled the number of opinions in each and every category from Week 1 to Week 2, ‘rising’ would not show any increase.

This definition of ‘rising’ is used to enable comparability between countries. If you are interested in the absolute (rather than relative) rising, this is viewable on the Country Report page under ‘Trends’.

‘Gender gap’

Shows which categories are talked about more by women than men (brown), and more by men than women (blue), as a proportion of the conversation of that gender.

Values are the difference between female and male proportions (%) of the conversation per category. So all female category %s sum to 100%, all male category %s sum to 100%, and we show the difference between these numbers.

Female and Male data is estimated using aggregated and anonymized profiling (e.g. from names, bios). Learn more.

Citibeats recognizes that female and male are not the only genders.

‘Intents’

Filters only the opinions which are questions, and, highlights where the outlier countries are, according to proportion of the conversation per category. Values are proportions (%) of the conversation, where all the categories sum to 100% for each country.

‘Questions’ are defined here as a phrase expressed to elicit information, including expressing one's doubts about something or checking it’s validity or accuracy.

‘Complaints’ are defined here as statements that something is unsatisfactory or unacceptable, and which have some potential to be actionable.