NORMA eResearch @NCI Library

A Study of Different Pre-Processing Approaches of Text Categorization

Narke, Sharan Prabhulinga (2017) A Study of Different Pre-Processing Approaches of Text Categorization. Masters thesis, Dublin, National College of Ireland.

[thumbnail of Master of Science]
Preview
PDF (Master of Science)
Download (974kB) | Preview

Abstract

Text Pre-processing is a process of converting raw text data in to corpus (bag of words) which is further fed into different classifiers for text categorization. This paper presents the results of an experimental study of some text pre-processing techniques used against various classification algorithms.The main intent is to understand and discover best possible pre-processing technique to procure better classifier performance. In particular, text pre-processing techniques like Document Term Matrix (DTM), Term Document matrix (TDM) and Term Frequency-Inverse Document Frequency (TF-IDF) were used against 10 different classifiers on BBC News dataset. A comparative performance analysis of classifiers is conducted using evaluation metrics like Accuracy, Precision, Recall and F-score. The results indicate TF IDF as better pre-processing method aiding better classifier performance than DTM and TDM.

Item Type: Thesis (Masters)
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science
Q Science > QA Mathematics > Computer software
T Technology > T Technology (General) > Information Technology > Computer software
Divisions: School of Computing > Master of Science in Data Analytics
Depositing User: Caoimhe Ní Mhaicín
Date Deposited: 28 Aug 2018 11:15
Last Modified: 28 Aug 2018 11:15
URI: https://norma.ncirl.ie/id/eprint/3084

Actions (login required)

View Item View Item