Performance Study on Indexing and Accessing of Small File in Hadoop Distributed File System

Rodrigues, Anisha P; Fernandes, Roshan; Vijaya, P.; Chander, Satish

Performance Study on Indexing and Accessing of Small File in Hadoop Distributed File System

Anisha P Rodrigues (), Roshan Fernandes, P. Vijaya and Satish Chander
Additional contact information
Anisha P Rodrigues: Department of Computer Science and Engineering, NMAM Institute of Technology, Nitte, India
Roshan Fernandes: Department of Computer Science and Engineering, NMAM Institute of Technology, Nitte, India
P. Vijaya: #x2020;Department of Mathematics & Computer Science, Modern College of Business and Sciences, Bowshar, Sultanate of Oman
Satish Chander: #x2021;Department Computer Science and Engineering, Birla Institute of Technology, Ranchi, India

Journal of Information & Knowledge Management (JIKM), 2021, vol. 20, issue 04, 1-21

Abstract: Hadoop Distributed File System (HDFS) is developed to efficiently store and handle the vast quantity of files in a distributed environment over a cluster of computers. Various commodity hardware forms the Hadoop cluster, which is inexpensive and easily available. The large number of small files stored in HDFS consumed more memory which lags the performance because small files consumed heavy load on NameNode. Thus, the efficiency of indexing and accessing the small files on HDFS is improved by several techniques, such as archive files, New Hadoop Archive (New HAR), CombineFileInputFormat (CFIF), and Sequence file generation. The archive file combines the small files into single blocks. The new HAR file combines the smaller files into a single large file. The CFIF module merges the multiple files into a single split using NameNode, and the sequence file combines all the small files into a single sequence. The indexing and accessing of a small file in HDFS are evaluated using performance metrics, such as processing time and memory usage. The experiment shows that the sequence file generation approach is efficient when compared to other approaches concerning file access time is 1.5s, memory usage is 20 KB in multi-node, and the processing time is 0.1s.

Keywords: Hadoop Distributed File System; MapReduce; Hadoop Archive; combinefileinputformat; sequence file (search for similar items in EconPapers)
Date: 2021
References: Add references at CitEc
Citations:

Downloads: (external link)
http://www.worldscientific.com/doi/abs/10.1142/S0219649221500519
Access to full text is restricted to subscribers

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:wsi:jikmxx:v:20:y:2021:i:04:n:s0219649221500519

Ordering information: This journal article can be ordered from

DOI: 10.1142/S0219649221500519

Access Statistics for this article

Journal of Information & Knowledge Management (JIKM) is currently edited by Professor Suliman Hawamdeh

More articles in Journal of Information & Knowledge Management (JIKM) from World Scientific Publishing Co. Pte. Ltd.
Bibliographic data for series maintained by Tai Tone Lim ().