An Exploration of How Twitter Reflects Public Opinion and Behavior Affecting the Spread of COVID
Measures to slow the spread of the COVID-19 pandemic have reshaped nearly every facet of life in the United States. Government and health officials are utilizing models such as the COVID-19 scenarios from the University of Washington Institute for Health Metrics and Evaluation (IHME) to understand the potential trajectories of the pandemic’s spread and how changing guidance on mobility and mask wearing could affect outcomes, especially the burden on hospitals and number of deaths. This research found that sentiment analysis of tweets linked to the keyword “mask” did show a correlation to surveys on mask use and to metrics for confirmed and estimated COVID-19 cases as well as deaths. While this research was limited in timeframe and access level to Twitter data, the results appear promising and could justify further research into this approach to measuring opinions and the link to inferred behaviors for modeling purposes.
Measures to slow the spread of the COVID-19 in the United States have reshaped daily routines for work, education, socializing, exercise, and even access to routine health care. The public and government and health officials are all interested in anticipating the trajectory of COVID-19 cases to determine what continued and new health precautions are necessary, especially when the implementation of government directives to wear masks and reduce mobility have become politically controversial. The University of Washington Institute for Health Metrics and Evaluation (IHME) model for COVID-19, which is widely cited in the media and created to support policy development, explicitly bases its scenarios on public behavior, including use of masks and mobility restrictions. The model draws upon survey and technical data to understand how the public has behaved historically and forecasts future spread based on potential changes in those behaviors. Twitter may also provide unique insights into how opinion drives behavior, which may be more significant than government directives in shaping actual behavioral change in the current politically charged climate in the United States. As a preliminary analysis to explore the utility of Twitter data, our team collected 1000 tweets each for search words “mask”, “quarantine”, “COVID”, and “coronavirus” daily. I utilized the TextBlob library to generate a basic sentiment score and explore how different permutations of tweet metrics could generate different overall mean sentiment scores to compare to data for mask use, mobility reduction, and COVID case data. While there were several limitations to this analysis including limited data and partially mismatched geographies for the datasets, I found the search word “mask” showed promising correlations to any of the measures. More targeted search words may be necessary to model mobility data and general social-distancing behaviors. The results are promising enough to warrant additional research to explore how models of public opinion and behavior could be developed with social media sentiment analysis.
Twitter created a specific API for researchers exploring the COVID-19 pandemic in April 2020. A study found text mining of Twitter could predict COVID-19 outbreaks based on the number of tweets. A pre-print literature review of machine-learning research focused on COVID-19 identified multiple projects exploring social media sentiment analysis for COVID-19 case identification, including tweet classification, symptom identification, and test result reporting.
This research seeks to determine if sentiment analysis of Twitter data can provide data to supplement or replace existing datasets utilized for modeling public behavior to project COVID-19 spread in the United States. I explored this question in two ways. First if sentiment analysis of tweets with the keywords “mask” and “quarantine” could produce scores which correlate to survey/technical data on mask use and mobility reduction respectively. I also tested if sentiment analysis of tweets with the keywords “mask”, “quarantine”, “COVID”, or “coronavirus” could produce scores which correlate to case numbers for confirmed infections, estimated infections, and/or deaths on a one to four-week lag.
I downloaded IHME’s weekly COVID-19 datasets through the period of this research to use as a baseline. From this dataset we used US and global data for confirmed and estimated infections, deaths, and mobility reduction. While this approach provides a better idea of the scale of the pandemic, it also has limitations as the rate of death to infection varies based on other factors like availability of medical resources and differences in the underlying health of the population. The IHME mobility reduction measure is based on anonymized technological data and is, in essence, a survey of devices instead of people. The technical mobility and associated measures are unable to capture social distancing measures like staying six feet apart and maintaining specific “bubbles” of contacts. IHME does not include data on mask use in their download packages. This analysis is based on data published on 3 December 2020 by IHME and 28 November 2020 by YouGov. Figure 1 below visualizes YouGov mask use survey results and IHME mobility reduction data against IHME data for confirmed infections, estimated infections, and deaths.
I called Twitter’s 2.0 “Recents” API daily between 12 October and 3 December 2020 for 1000 tweets each for the words “COVID”, “mask”, “quarantine”, and “coronavirus” between 1100 and 1300 hours EST (with the exception of three days). With the exception of the data after 27 November 2020, all tweets were collected about six days after they were posted, allowing the influence metrics — likes, retweets, quotes, and replies — to accrue. While our data sample for the period is generally consistently collected, we could not collect a large enough period of data for significant findings for all tests (p<0.05). In addition, because Twitter does not give basic student accounts access that allows pre-filtering tweets by geography or language, we could not create a dataset geographically comparable to the IHME and YouGov data. This disconnect prevents me from drawing more than preliminary conclusions about whether this approach has utility to justify further research. We also noted Twitter’s API returns not only tweets with the specific search word in the tweet itself but also any tweets with the search word in any of the annotated materials, included AI-assessed topics. The result was the return of a significant number of tweets assessed by Twitter to be about COVID or coronavirus but without the requested search term.
The timeline and quantity of data collected was determined by the basic access provided to students by Twitter. Twitter’s basic student research accounts only provide access to seven days of historical data, limiting our ability to conduct analysis retroactively. Students are also limited to collecting 500,000 tweets per month. I explored the GetOldTweets and GetOldTweets3 python libraries to conduct more extensive historical collection, but these tools were dependent on a Twitter access page that has been decommissioned.
Because Twitter is ubiquitous in the United States and the TextBlob library only works on English language data, we wanted to focus our analysis on the United States. Given the limitations noted above for student developer accounts, we had no way to limit the scope of our collection to create a precisely comparable geography for the IHME and YouGov data. To address this issue, I created three “test groups” for comparison, but our analysis found comparison to the UnitedStates alone provided the most strongly correlated and statistically significant results (see Tables 2 and 3).
I established three geographical test groups focused on English-speaking countries: the United States; the United States, United Kingdom, and Canada (US-UK-CAN); and the United States, United Kingdom, Canada, India, Malaysia, and the Philippines (US-PLUSFIVE). I selected these countries based on October 2020 data on countries with the most Twitter users combined with fragmentary data about the number of English speakers globally. For each group I multiplied the number of Twitter users by the percentage of population that speaks English. I then calculated a weight for each countries’ data for each group based on their proportion of the total users in the group. This approach gave me a preliminary set of data for analysis.
Creating test groups with weighted statistics instead of just a mean of all the scores was important because of the uneven impact of the pandemic and dramatically different use of mitigation measures. The utility of mask use (Figure 2) and mobility reductions (Figure 3) are different from country to country because additional factors like population density, degree of domestic and foreign mobility, and health care resources will also all play a role in COVID-19 spread.
For the IHME and YouGov data, I used the python Pandas library’s group by function to create means of the variables of interest by week. YouGov’s mask use survey data typically included one survey per country per week, but there were some gaps. For those gaps, I utilized the interpolate function to create sufficient for analysis. I noted the IHME data for the United Kingdom did not include confirmed infections. We relied on the values for estimated infections for the United Kingdom for the building the test group data. Given that estimated and confirmed infections are typically different in scale, this should have created distortion in the data, but the comparisons of the visualizations seem to imply that the models are similar across confirmed and estimated infections as well as deaths. I used the Scipy library to test the correlation between the mobility reduction and mask use data to confirmed infections, estimated infections, and confirmed deaths on a one to four-week lag. I used the Kendall’s Tau test because not all of the countries included in the groupings had normally distributed data and our sample size was so small.
I intended for these results to provide preliminary statistical targets for how the sentiment data should perform to be viable as a variable for inclusion in a COVID-19 spread model. These statistical baseline tests also focused our tests for the sentiment against death data, establishing that a four-week lag may be insufficient to manifest changes in death rates, although the change in deaths from week to week may show early indications of “flattening the curve”. The positive correlation between mask use and infections was unexpected (Table 2). This positive pattern could reflect that a greater willingness to wear masks when infections are increasing. The negative correlation between masks and deaths on a three to four-week lag does suggest mask use may reduce infections (given data for deaths are the most reliable of the case measures). At a three to four-week lag, mobility reductions did have a negative correlation to infections as we expected (Table 3).
I utilized the TextBlob library, which is pre-trained on IMBD movie reviews, to calculate sentiment scores (polarity and subjectivity) for all the tweets. With the Pandas library, we utilized the sentiment scores and tweet metrics to calculate six different possible sentiments for each term each day (see Appendix for formulas): Polarity (1), Polarity*Subjectivity (2), Polarity+Metric (3), Polarity+Metric*Subjectivity (4), Polarity*Metric (5), Polarity*Metric*Subjectivity (6). Finally, I calculated mean scores for each search word for each sentiment metric type by week.
Making use of Pandas data frame and pivot capabilities, I merged the datasets and again used the Scipy library to conduct statistical tests. Normality tests indicated data distributions for multiple variables across our datasets were not normal; this finding was not unexpected given the limited size of samples. As a result, I used a Kendall’s Tau test for correlation and Loess transformation to visualize the regression line. I utilized the Altair library to create visualizations of the data.
Based on research, Twitter sentiment scores could have potential for inclusion in models of public opinion and behavior. I found Twitter sentiment scores for “mask” and “quarantine” cannot be used as a direct replacement for survey/technical data used to measure mask use and mobility reduction. Correlation (τ) were sufficiently high to support exploring the construction of models, but word choices must be narrow. When comparing Twitter sentiment scores to confirmed infections, estimated infections, and deaths on a one to four-week lag, we found only the term “mask” has any indication of potential for building a model of future spread. The terms “quarantine”, “COVID”, and “coronavirus” had highly inconsistent correlations across and within metric types. We suspect the terms are too broad and capture too many tweet topics and threads for consistency.
Using our baseline understanding of the level of correlation (τ) and degree of significance demonstrated by mask use and mobility data to case statistics in the IHME model (see above, Tables 2 and 3), I examined the correlation of “mask” sentiments to both the mask use survey data and confirmed cases, estimated cases, and deaths. I visualized the shape of the “mask” sentiment curves in the same field as the mask use data to visual compare their curves (Figure 4). I also created scatterplots to explicitly correlated the data (Figure 5).
When comparing “mask” sentiment metrics to the survey data on masks, the Polarity*Metric performed the best, τ (7) = -0.69, p <0.05 (see Table 4). The importance of including the tweet metrics data to achieve a correlation with survey data is not surprising as many of these metrics presumably constitute positive engagement and shared opinions, potentially creating a larger and more accurate sample. While the causal relationship is unclear, as people wear masks more, they may also complain more about it.
When comparing the sentiment metrics to case data, I created visualizations to allow the comparisons of the curves with a one to four-week lag (Figure 6).
Rather than create scatterplots for each of the sentiment metrics verses case metrics, we shifted directly to statistical tests over visualizations given the number of variables we were analyzing. The best statistical match was for the polarity*subjectivity metric at a two to four week lag: τ2weeks(7) = 1, p = 0.02; τ3weeks(6) = 1, p = 0.08; τ3weeks(5) = 1, p = 0.33 (see Table 5). With the iteration of our tests as more data became available, we have seen the p-value consistently decline. This positive correlation indicates positive sentiments about masks are associated with rises in cases and deaths, not the negative correlation I would have expected if sentiment was a measure of behavior.
Because our baseline IHME data performed the best for the mobility reduction data, I hoped the “quarantine” sentiment data would also be useful. None of the p-values are significant (and consistently higher than the “mask” sentiment data) and the τ values were also not as strong. The scatterplot of the sentiment to mobility reduction numbers show diffuse data and generally weak relationships. We suspect the term“quarantine” may not have been the best match for mobility data. It may be too narrow a reference with a more medical context. I had initially considered “Social distancing” as a keyword, but “distancing” would have required stemming to capture the various forms of the verb. Social distancing is also more a description of interpersonal behavior rather than a measure of community and general movement patterns which the mobility score captures.
I also conducted correlation tests for sentiments for “COVID” and “coronavirus”. The results indicated these terms are not useful for modeling future COVID spread. The data for these terms were especially muddled by Twitter’s use of AI to add topics to tweets, and the data sets partially overlapped as a result.
While this project was challenged by a small dataset and imprecise geographic coverage, the results for “mask” sentiments are sufficiently promising to justify additional research. If the selection of key terms and approach to processing could be refined, it could be a quick and inexpensive way for researchers to quickly gather data about specific public behaviors which are relevant to predictive models. While our research was focused on COVID-19, this approach could be applied to any model based on public opinion and behaviors. While sentiment analysis is already gaining popularity for brand management, it is also used for more focused research on emerging topics for political research and market opportunity analysis. For of these applications beyond English-speaking countries, the training datasets for sentiment analysis tools need to be expanded. Moving forward with our specific research questions is entirely dependent on expanded access to Twitter data, including historical and geolocated data, to build more comparable datasets. Additionally, conducting additional research to determine if retweets, quotes, and replies should be counted as positive metrics for a tweet might further refine the role of tweet metrics in creating overall sentiment scores. Overall, our research was a compelling opportunity to explore how the rapidly expanding qualitative data available via social media can be structured and quantified for analysis.
IHME’s model of COVID-19 spread is a fascinating effort in its creation of a hybrid disease spread model. It was central to inspiring and executing our project. YouGov, a commercial survey platform, is making many COVID-19 datasets freely available, some of which might be able to provide additional nuance to the technically focused mobility data. We look forward to further exploring Twitter’s 2.0 APIs and the promised expanded access for researchers; these advancements represent a commendable commitment by Twitter to the academic research community as Twitter works to secure its long- term profitability.