Performance Study on Indexing and Accessing of Small File in Hadoop Distributed File System
Anisha P Rodrigues (),
Roshan Fernandes,
P. Vijaya and
Satish Chander
Additional contact information
Anisha P Rodrigues: Department of Computer Science and Engineering, NMAM Institute of Technology, Nitte, India
Roshan Fernandes: Department of Computer Science and Engineering, NMAM Institute of Technology, Nitte, India
P. Vijaya: #x2020;Department of Mathematics & Computer Science, Modern College of Business and Sciences, Bowshar, Sultanate of Oman
Satish Chander: #x2021;Department Computer Science and Engineering, Birla Institute of Technology, Ranchi, India
Journal of Information & Knowledge Management (JIKM), 2021, vol. 20, issue 04, 1-21
Abstract:
Hadoop Distributed File System (HDFS) is developed to efficiently store and handle the vast quantity of files in a distributed environment over a cluster of computers. Various commodity hardware forms the Hadoop cluster, which is inexpensive and easily available. The large number of small files stored in HDFS consumed more memory which lags the performance because small files consumed heavy load on NameNode. Thus, the efficiency of indexing and accessing the small files on HDFS is improved by several techniques, such as archive files, New Hadoop Archive (New HAR), CombineFileInputFormat (CFIF), and Sequence file generation. The archive file combines the small files into single blocks. The new HAR file combines the smaller files into a single large file. The CFIF module merges the multiple files into a single split using NameNode, and the sequence file combines all the small files into a single sequence. The indexing and accessing of a small file in HDFS are evaluated using performance metrics, such as processing time and memory usage. The experiment shows that the sequence file generation approach is efficient when compared to other approaches concerning file access time is 1.5s, memory usage is 20 KB in multi-node, and the processing time is 0.1s.
Keywords: Hadoop Distributed File System; MapReduce; Hadoop Archive; combinefileinputformat; sequence file (search for similar items in EconPapers)
Date: 2021
References: Add references at CitEc
Citations:
Downloads: (external link)
http://www.worldscientific.com/doi/abs/10.1142/S0219649221500519
Access to full text is restricted to subscribers
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:wsi:jikmxx:v:20:y:2021:i:04:n:s0219649221500519
Ordering information: This journal article can be ordered from
DOI: 10.1142/S0219649221500519
Access Statistics for this article
Journal of Information & Knowledge Management (JIKM) is currently edited by Professor Suliman Hawamdeh
More articles in Journal of Information & Knowledge Management (JIKM) from World Scientific Publishing Co. Pte. Ltd.
Bibliographic data for series maintained by Tai Tone Lim ().