Sentiment analysis for stocks in S&P 500


In my last post, I have detailed my valuation for Tesla and concluded that mood and momentum are in the driver’s seat in Tesla’ incredulous ride to a price of USD 400 per share (USD 2000 per share before stock’s split), underpinning the importance of taking such factors in consideration even in value investing, for investors should want these to also work in their favors.

Getting started — Data

I scraped news titles and their release time from Since the website incorporate news from a variety of sources (Bloomberg, Motley Fool, Reuters, etc.), it reduces selection bias from the data collected. I include time release of the news because I will time-weight the news in the process of calculating stocks’ sentiment analysis (the more recent a news, the larger effects it has on the stock’s overall score). Overall there are 102 stocks out of 500 that have sufficient data for further sentiment analysis.

Sentiment Analysis — Methodology and algorithm

For each stock, I will use a combination of nature language processing (NLP) packages such as spacy (for lemmatization), nltk (for stop-word removal), and vader (for sentiments analysis).
Lemmatization is a common data-cleaning process in NLP. It returns the word to its “original form” so that it gets easier for sentiment analysis algorithms to process. For example, in the sentence “he goes to the park and exercises.”, the word “goes” will be transformed into its original form — go, while he will be reclassified generally as “Pronoun”.
Stop-words in English are common word such as “the”, “to”, “a”, etc. These words are there for grammatical purposes and appear quite often but do not convey much meaning. As a result, these words should be removed before the whole sentence is analyzed.
After also removing punctuation marks, the sentence is ready for sentiment analysis. The Vader algorithm scores each word. Then the scores are added up and weighted to get an overall score of the whole sentence (the news’ title).
All news in a given day are weighted equally, with news from today are weighted 1, while news from the past are weighted less than 1 (news from n days ago are weighted 1* 0.95^n).


The following graph shows the distribution of sentiment scores produced by the Vader algorithm as well as the top 5 and bottom 5 companies that have the most positive (>0) and negative (<0) sentiment scores:

The ways forward

There are a variety of applications of sentiment analysis. It is especially a useful tool in pricing multiples and explaining multiples. As opposed to intrinsic valuation, the essence of multiples is to understand how investors perceive other similar assets, which is why sentiment definitely plays a huge part in such calculations and should be incorporated in analyzing multiples.


Here is the link to my data and to my code for this project:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store