How is the Reported Sentiment Score Calculated?
The reported sentiment score on Buzzlytix is a comprehensive metric that reflects the overall mood of news coverage across all analyzed articles and sources. Here's how we calculate it:
1. Article Collection & Preprocessing
- We gather articles from a variety of news sources daily.
- Each article is processed to extract its main content and title.
2. Sentiment Analysis
- For each article, we use a 5-class sentiment model (very positive, positive, neutral, negative, very negative) to predict the sentiment of the article's content.
- If the content is missing, a fallback 2-class model is used on the title.
- Each article's sentiment is represented as a probability distribution across the five classes.
3. Keyword Extraction & Consolidation
- We extract key entities (people, organizations, places, etc.) from each article's title using advanced natural language processing (spaCy).
- Keywords are consolidated to group similar or related terms together, ensuring accurate aggregation.
4. Aggregating Sentiment by Keyword
- For each keyword, we aggregate the sentiment distributions from all articles mentioning it.
- Soft counts (probabilities) are summed, and only keywords with at least 5 articles are included in the final statistics.
5. Calculating the Overall Sentiment Score
- For each keyword, we compute a sentiment score by assigning values to each class:
very negative = -2
,negative = -1
,neutral = 0
,positive = 1
,very positive = 2
. - The score is normalized to the range [-1, 1] and damped by the proportion of neutral sentiment (to reduce the impact of uncertainty).
- The final reported sentiment score is the average of these damped scores across all keywords.
Interpretation: A score near 1 means news is overwhelmingly positive, -1 means overwhelmingly negative, and 0 means neutral or mixed coverage.
Technical Details
- Sentiment models: OpenAI API (gpt-4.1-nano, 5-class prompt-based classification).
- Keyword extraction: spaCy large English model.
- Data is updated daily and statistics are recalculated for each new batch of articles.