TRAP@NCI

Technical Report: Open Lake

Mohamed, Idriss (2018) Technical Report: Open Lake. Undergraduate thesis, Dublin, National College of Ireland.

[img]
Preview
PDF (Bachelor of Science)
Download (1MB) | Preview

Abstract

This project document gives an overview of a data lake system. The report is prepared in partial fulfilment of requirements towards obtaining BSc (Honors) in Computing- Program at the National college of Ireland.

This system will be used by several companies which need large data sets that are clean, complete and have business relevance. There are huge problems associated with growing public open source data. The report specifically illustrates the benefits of extracting data from open public source data sites and storing it in a data lake system. The data lake storage is cheap to build, and it can store a large amount of data.

There are massive open public datasets in Ireland and Europe. The system is going to store as much data as possible to data lake storage, extracted from such sites. Data transformations are required for standardization. Machine learning techniques can be employed to extract useful patterns from this data that could be used by several business units. The technical part of the report will focus on the process of building a data lake system.

The data extracted from different websites were in different formats such as .doc, .xls, .pdf, .xml,. json, px and various other extensions. A web crawler program in python will exhaustively search through all the child links in a website for data. The extraction queries were written in PostgreSQL and pgAdmin package. MS-EXCEL was used to store the list of target URLs and use that as input document for extraction program. The extracted files were stored in HDFS. Hive was used to query the Hadoop file system. The entire project was hosted in GitHub.

Item Type: Thesis (Undergraduate)
Subjects: Q Science > QA Mathematics > Electronic computers. Computer science
T Technology > T Technology (General) > Information Technology > Electronic computers. Computer science

Q Science > QA Mathematics > Computer software
T Technology > T Technology (General) > Information Technology > Computer software
Divisions: School of Computing > Bachelor of Science (Honours) in Computing
Depositing User: CAOIMHE NI MHAICIN
Date Deposited: 07 Nov 2018 09:38
Last Modified: 07 Nov 2018 09:38
URI: http://trap.ncirl.ie/id/eprint/3467

Actions (login required)

View Item View Item