Large-scale language models for innovation and technology intelligence: sentiment analysis on news articles

Matthias Plaue   ·   May 8 2023

Sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. [Liu 2020]

Introduction: sentiment analysis in business intelligence

As an instrument for business intelligence, sentiment analysis of product reviews can provide decision support [Dellarocas et al. 2007, Chen & Xie 2008] and has been recognized in the industry as a powerful tool to mine for marketing insights [Wertz 2018, Newberry 2022]. Gauging public sentiment can even be a component of predicting the stock market and thus informing investment decisions [Bollen et al. 2011].

In the context of innovation and technology intelligence, sentiment analysis can provide a variety of actionable insights, for example:

  • public attitudes and opinions towards an emerging technology that could impact the speed of its adoption, e.g., driverless cars [Kwarteng et al. 2020],
  • the impact of adopting a new technology on the adopters’ reputation [Caviggioli et al. 2020],
  • consumer sentiment towards high-tech products [Kauffmann et al. 2020], or technology and environmental policy, e.g., plastic pollution control [Sun et al. 2022],
  • factors that are generally associated with startup business success [Saura et al. 2019].

That is why MAPEGY’s innovation and technology intelligence tool MAPEGY.SCOUT features the Sentiments panel wich allows the user to gauge the general public attitude towards a technology, trend topic, or organization:

Sentiment analysis of Tesla; positive vs. negative news

The goal of this report is the evaluation of different algorithmic approaches to sentiment analysis in terms of accuracy, implementation effort, and performance.

Setup and scope of evaluation

Advanced sentiment analysis can extract the emotional state expressed by the author of a text, such as one or multiple of primary emotional states like fear, anger, sadness, or joy [Kemper 1987].

More basic sentiment analysis aims at determining the polarity of a text: whether it delivers a negative or positive sentiment. Numerically, polarity may be gauged by a polarity score that takes values in some continuous range, typically on the interval between -1 and 1. For the purpose of this evaluation, we will treat sentiment analysis as a classification task, and assign each text to one of the categories “negative,” “neutral,” or “positive.”

Sentiment analysis can be performed on document level, sentence level, or aspect level. Aspect-level sentiment analysis aims to determine polarity towards a specific opinion target. For the purpose of this evaluation, we will determine the sentiment on document level.

Typical data that one mines for opinion or sentiment include the following:

Already labeled datasets in the general domain of news, business, and finance are the following, all available on kaggle:

For the purpose of this evaluation, we will use a custom dataset.

Algorithmically, the problem of sentiment analysis can be solved either by rule-based approaches, using dictionaries like Lexicoder or VADER, or by applying machine learning [Dang et al. 2020].

The goal of this report is to compare the performance of the following methods:

  • VADER,
  • a large-scale language model (LLM) finetuned on an open, non-domain-specific dataset (“pretrained”),
  • an LLM finetuned on a dataset labeled by MAPEGY.

Preparation of training and test data

As of May 2023, the MAPEGY Innovation Graph includes more than 70 million news articles — collected since 2016 — from feeds relevant to the domain of innovation and technology. This evaluation is based on 1100 manually labeled articles taken from this collection.

Sentiment can be an expression of the author’s attitude or emotion towards the reported event or the subject manner:

However, many or most news articles are not opinion pieces, or advertizing. Keeping in mind the goal to present an analysis relevant to the domain of innovation and technology, the articles have also been labeled according to criteria that are tied to the event or fact being reported, irrespective of the manner that it had been reported in:

Here are a few typical examples from the (test) dataset:

Negative:

  • Tesla’s Next-Gen Roadster Seems To Break Down During Test Ride By Chief Designer [accident; unreliable technology]
  • China slowdown weighs on revenue growth at internet giant Alibaba [economic decline]
  • Sydney is flooded, again, as climate crisis becomes new normal for Australia’s most populous state [natural disaster]

Neutral:

  • Smart metering major component of Tauron’s updated research agenda [factual; no clear up-/downside]
  • Wireless Power Transmission Market Trends Analysis, Top Manufacturers, Shares, Growth Opportunities and Forecast to 2026 [neutral market report]
  • VIDEO: Pros And Cons Of Condition Monitoring Services [balanced opinion]

Positive:

  • Vertex Global Services wins Multiple Brand Opus Awards for innovation, technology and business [company winning award]
  • How air quality and weather data help cities boost resilience [improvement; pro-sustainability action]
  • Gene Therapy Sees Encouraging Success In Child With Duchenne Muscular Dystrophy [success; improvement through technology]

Implementation and used technologies

The evaluation was run in Google Colab, the training and test data stored in Google Sheets.

The Python packages numpy (scientific computing), pandas (shaping data), sklearn (machine learning), and matplotlib (data visualization) are well-known tools for data analysis. Additionally, the following packages have been used:

The MAPEGY dataset was randomly split into 300 test examples, and 800 training examples for finetuning.

The VADER polarity scores p were trichotomized: negative if p < -0.37, positive if p > 0.37 and neutral otherwise. The optimal threshold was determined from a simple grid search maximizing prediction accuracy on the training dataset.

The pretrained model was selected by virtue of being well-documented, and one of the most downloaded from the Hugging Face library: cardiffnlp/twitter-roberta-base-sentiment-latest [Barbieri et al. 2020].

Other models that include a “neutral” category and were initally considered include:

For finetuning, the distilroberta-base model was used, by virtue of promising fast inference while maintaining good accuracy [Sanh et al. 2019].

In general, all tasks could be implemented with only a few lines of code. For example, the following code installs transformers, and runs the pretrained model on one short example text:

!pip install transformers --quiet
from transformers import pipeline
sa = pipeline("sentiment-analysis", top_k = None,
model = "cardiffnlp/twitter-roberta-base-sentiment-latest",
device = 0, padding = True, truncation = True,
max_length = 512,
verbose = False)

print(sa("China slowdown weighs on revenue growth at internet giant Alibaba"))

Aside from straightforward programming, most efforts went into labeling the dataset, and debugging errors produced by feeding unexpected data types or number of tokens to the models.

Evaluation results

The following confusion matrices have been normalized by number of predicted labels, i.e., the columns sum to one. Baseline accuracy given by class majority vote (always guessing “neutral”) is given by 46%.

Inference speed was achieved on the “standard GPU” setting of Google Colab. Total runtime of finetuning (4 epochs): 223 secs.

VADER — 50% accuracy; inference speed: ~1900 test examples/sec.

Pretrained LLM — 67% accuracy; inference speed: ~80 test examples/sec.

Finetuned LLM — 80% accuracy; inference speed: ~40 test examples/sec.

On a CPU, LLM inference times drop to about 3–6 test examples/sec.

It appears to be difficult to ensure reproducibility of results for the finetuning even when fixing both numpy and tensorflow random seeds: model accuracy may vary in a range of about 75–80%.

The examples listed earlier were selected before test time, solely based on the subjective assessment that those were “typical.” Incidentally, the finetuned model delivers 100% accuracy on those:

VADER delivers a signed polarity score out of the box. Even though the LLMs have been used for classification, they also provide the logits associated with each class. This information allows for the computation of a polarity score for these models, as well: First, the logits can be mapped onto a probability distribution p(-1), p(0), p(1) by applying a softmax function. The polarity score s can be computed from this distribution via the formula s = (-1) ⋅ p(-1) + p(1).

This allows us to compare the predictions of all methods by comparing the polarity scores produced by each. The following is the correlation matrix based on Spearman’s ρ:

Conclusion

  • All models work as well as can be expected: they rarely confuse news articles with negative and positive sentiment.
  • While being fast, VADER delivers results barely above baseline.
  • Finetuning even with a very small training dataset (only a few hundreds of training examples) yields a considerable boost in accuracy.
  • On a “standard” GPU, determining the sentiment for 70 million news can be expected to take 10–20 days of computation on a single machine.
  • Without further optimization, running inference on a CPU for large document collections cannot be recommended: computing polarity for 70 million news articles can be expected to take up to 280 days of computation time.

Acknowledgment

This blog post has been produced in the context of the research project KI4BoardNET funded by the Federal Ministry of Education and Research (Germany).

References

[Liu 2020] Bing Liu. “Sentiment Analysis.” 2nd ed. Studies in Natural Language Processing. Cambridge University Press (Oct. 2020).

[Dellarocas et al. 2007] Chrysanthos Dellarocas, Xiaoquan (Michael) Zhang and Neveen F. Awad. “Exploring the value of online product reviews in forecasting sales: The case of motion pictures.” Journal of Interactive Marketing 21, no.4 (2007): 23–45.

[Chen & Xie 2008] Yubo Chen and Jinhong Xie. “Online Consumer Review: Word-of-Mouth as a New Element of Marketing Communication Mix.” Management Science 54, no. 3 (2008): 477–91.

[Wertz 2018] Jia Wertz. “Why Sentiment Analysis Could Be Your Best Kept Marketing Secret.” Forbes (Nov. 2018).

[Newberry 2022] Christina Newberry. “Social Media Sentiment Analysis: Tools and Tips for 2023.” Hootsuite Blog (Sep. 2022).

[Bollen et al. 2011] Johan Bollen, Huina Mao and Xiaojun Zeng. “Twitter mood predicts the stock market.” Journal of Computational Science 2, no.1 (2011): 1–8.

[Caviggioli et al. 2020] F. Caviggioli and L. Lamberti and P. Landoni and P. Meola. “Technology adoption news and corporate reputation: sentiment analysis about the introduction of Bitcoin.” Journal of Product & Brand Management 29, no.7 (2020): 877–897.

[Kauffmann et al. 2020] Erick Kauffmann, Jesús Peral, David Gil, Antonio Ferrández, Ricardo Sellers and Higinio Mora. “A framework for big data analytics in commercial social networks: A case study on sentiment analysis and fake review detection for marketing decision-making.” Industrial Marketing Management 90 (2020): 523–537.

[Sun et al. 2022] Ying Sun, Deyun Wang, Xiaoshui Li, Yiqing Chen and Haixiang Guo. “Public attitudes toward the whole life cycle management of plastics: A text-mining study in China.” Science of The Total Environment 859, no.1 (2022).

[Saura et al. 2019] Jose Ramon Saura, Pedro Palos-Sanchez and Antonio Grilo. “Detecting Indicators for Startup Business Success: Sentiment Analysis Using Text Data Mining” Sustainability 11, no. 3 (2019): 917.

[Kwarteng et al. 2020] M.A. Kwarteng, A. Ntsiful, R.K. Botchway, M. Pilik and Z.K. Oplatková . “Consumer Insight on Driverless Automobile Technology Adoption via Twitter Data: A Sentiment Analytic Approach.” In: S.K. Sharma, Y.K. Dwivedi, B. Metri, N.P. Rana (eds): Re-imagining Diffusion and Adoption of Information Technology and Systems: A Continuing Conversation. TDIT 2020. IFIP Advances in Information and Communication Technology 617 (2020). Springer, Cham.

[Lexicoder] Lori Young and Stuart Soroka. “Affective News: The Automated Coding of Sentiment in Political Texts”. Political Communication 29.2 (2012): 205–231.

[VADER] C. Hutto and Eric Gilbert. “VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text”. Proceedings of the
International AAAI Conference on Web and Social Media 8.1 (May 2014): 216–225.

[Dang et al. 2020] Nhan Cach Dang, María N. Moreno-García and Fernando De la Prieta. Sentiment Analysis Based on Deep Learning: A Comparative Study”. Electronics 9, no. 3 (2020): 483.

[Kemper 1987] Theodore D. Kemper. “How Many Emotions Are There? Wedding the Social and the Autonomic Components.” American Journal of Sociology 93, no. 2 (1987): 263–289.

[Barbieri et al. 2020] Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa Anke and Leonardo Neves. “TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification.” Findings of the Association for Computational Linguistics: EMNLP 2020 (2020): 1644–1650.

[Sanh et al. 2019] Victor Sanh, Lysandre Debut, Julien Chaumond and Thomas Wolf. “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.” ArXiv abs/1910.01108 (2019).