An Optimal Data Placement Strategy for Improving System Performance of Massive Data Applications Using Graph Clustering
S. Vengadeswaran and
S. R. Balasundaram
Additional contact information
S. Vengadeswaran: National Institute of Technology, Tiruchirappalli, India
S. R. Balasundaram: National Institute of Technology, Tiruchirappalli, India
International Journal of Ambient Computing and Intelligence (IJACI), 2018, vol. 9, issue 3, 15-30
Abstract:
This article describes how the time taken to execute a query and return the results, increase exponentially as the data size increases, leading to more waiting times of the user. Hadoop with its distributed processing capability is considered as an efficient solution for processing such large data. Hadoop's Default Data Placement Strategy (HDDPS) allocates the data blocks randomly across the cluster of nodes without considering any of the execution parameters. This result in non-availability of the blocks required for execution in local machine so that the data has to be transferred across the network for execution, leading to data locality issue. Also, it is commonly observed that most of the data intensive applications show grouping semantics. Hence during query execution, only a part of the Big-Data set is utilized. Since such execution parameters and grouping behavior are not considered, the default placement does not perform well resulting in several lacunas such as decreased local map task execution, increased query execution time, query latency, etc. In order to overcome such issues, an Optimal Data Placement Strategy (ODPS) based on grouping semantics is proposed. Initially, user history log is dynamically analyzed for identifying access pattern which is depicted as a graph. Markov clustering, a Graph clustering algorithm is applied to identify groupings among the dataset. Then, an Optimal Data Placement Algorithm (ODPA) is proposed based on the statistical measures estimated from the clustered graph. This in turn re-organizes the default data layouts in HDFS to achieve improved performance for Big-Data sets in heterogeneous distributed environment. Our proposed strategy is tested in a 15 node cluster placed in a single rack topology. The result has proved to be more efficient for massive datasets, reducing query execution time by 26% and significantly improves the data locality by 38% compared to HDDPS.
Date: 2018
References: Add references at CitEc
Citations:
Downloads: (external link)
http://services.igi-global.com/resolvedoi/resolve. ... 018/IJACI.2018070102 (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:igg:jaci00:v:9:y:2018:i:3:p:15-30
Access Statistics for this article
International Journal of Ambient Computing and Intelligence (IJACI) is currently edited by Nilanjan Dey
More articles in International Journal of Ambient Computing and Intelligence (IJACI) from IGI Global
Bibliographic data for series maintained by Journal Editor ().