Text Mining the Enron Email Corpus for Stock Price Prediction

Doherty, Conor (2014) Text Mining the Enron Email Corpus for Stock Price Prediction. Diploma thesis, Dublin, National College of Ireland.

PDF (Diploma)
Download (1MB) | Preview


Enron was an American energy trading corporation. Enron went bankrupt in December, 2001 after manipulating the markets based on financial chicanery involving offshore companies and off balance sheet operations. Following its collapse, a 10Gb corpus of 500k emails of 160 executives was made public by investigators. This dissertation principally examined the novel possibility that email service providers or spying agencies could profitably text mine an email corpus by using email aggregate monthly email sentiment or aggregate monthly email volume in order to predict stock price movements for the purposes of quasi insider trading. First the email in “sent” email folders of Enron executives was pre-processed by a Python script using the Enthought Canopy environment to generate a text file containing 125,000 records of email from, to, date, and negative sentiment fields. Email negative sentiment was predicted by a Naive Bayes classifier trained on a movie reviews sentiment corpus. Then Datameer, a big data analytics tool that runs on Hadoop, was used to read the Python pre-processed email data, analytically aggregate the data using a high level spreadsheet interface, and visualise the results using infographics. Visual results indicated that there was little or no relationship between aggregate email sentiment and aggregate email volume and stock price movements. It was concluded that email service providers or security agencies can not derive insider trading information from simple text mining of corporate email. This should be reassuring for corporate email users. Finally, based on a learnability and usability analysis, Datameer was positively evaluated as a promising tool with a market niche amongst business analysts proficient in Excel who wished to extend their skillsets to incorporate heterogeneous big data analytics without having to learn low level programming of Hadoop.

Item Type: Thesis (Diploma)
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Divisions: School of Computing > Higher Diploma in Science in Data Analytics
Date Deposited: 12 Dec 2014 16:39
Last Modified: 12 Dec 2014 16:39

Actions (login required)

View Item View Item