Methadology

1. We used an Amazon AWS server to build our program. We did this partially because of the cost efficiency associated with cloud server, but also in order to have our program run at all hours, even when our local computers are not running.

2. We started with one Reuters RSS feed, and then scaled it to include feeds from Market Watch. With each RSS feed, we first scraped the xml content using the python package feedparser. We saved critical information, such as the date, hashed document ID, and url in our MySql database. We then encoded the xml into ascii form using the “encode” function. After that, we grabbed the url from the ascii page and opened that url with the python package requests. Next, we used the beautifulsoup package along with its methods to grab all the text from the article. Once we had the text, we ran a function we created to determine which company the article is about. In this function, we simply counted the number of times a fortune 500 company was mentioned. This information was also stored in our database.

3. If the article was new (unique doc ID), we ran a naive bayes classifier on the article text. See sentiment analysis section below for a more in depth discussion.

4. When we used the naive bayes classifier, we actually used the probability of the classification instead of the classification itself in order to get a continuous scoring scale. We generated the probability that the document is positive and then scaled the values to be on a -1 to 1 scale and stored that value in the database.

 

Sentiment Analysis and Pertinent Results

Naïve bayes classifier: The bayes formula is:

Screen Shot 2016-05-02 at 6.32.58 PM

For our purposes, this means that the probability of an article being positive or negative is equal to the probability of the features (words) occurring in the document given that it is positive or negative times the probability of it being positive or negative. The naïve assumption is that the features occur independently, so we can say:

Screen Shot 2016-05-02 at 6.33.05 PM

Naïve bayes is a commonly used classifier for sentiment analysis not only because of its ease of interpretation and algorithm simplicity but also because of it’s accuracy in testing. When run on thousands of strictly positive or negative reviews, the algorithm trains on the positive and negative words and creates a powerful classifier. Below are the testing results and the scores of the most heavily weighted words. Under the lines of code, there is the output of the training and testing. We trained on 1600 instances, tested on 400, and had a testing accuracy of 73.5%. Below that, there is a list of the most heavily weighted words. For example, outstanding shows up in positive reviews 13.9 times as often as it shows up in negative reviews. Insulting shows up in negative reviews 13.7 times as frequently as it appears in positives ones.

 

Screen Shot 2016-05-02 at 6.38.58 PM

 

The Bag of Words method is also a commonly used implementation for running sentiment analysis on texts. It works by keeping a count of the frequency of each word in a document, unlike naive bayes which just keeps track of whether or not a word appears. This set of word counts is kept as a vector X n and is considered the feature set for the naive bayes classifier. The frequencies are then used to determine the probabilities of a given word (W i ) appearing in the document given that the article is positive or negative (C j ). These probabilities are then used to determine the probability of an article being positive given the set of features (X n ) . Before a document can be used for analysis in the bag of words method, it is necessary to “clean” up the data. This involves removing common stop words such as the, and, a, and, etc. In addition, the words in the document need to be parsed into a list so that an interpreter (we used python) can process the information. We used a python package called nltk.corpus to aid in the preprocessing of the data. We imported a set called Stopwords which has commonly used stop words, and then iterated through each word in the article, removing it if it was a stopword. Last, all the words were transformed to lower case. While a more in depth sentiment analysis might try to use case to determine increased sentiment on a subject, we choose not to do so for this application of the problem.