A Performance Analysis of Hybrid and Columnar Cloud Databases for Efficient Schema Design in Distributed Data Warehouse as a Service
Fred Eduardo Revoredo Rabelo Ferreira () and
Robson do Nascimento Fidalgo
Additional contact information
Fred Eduardo Revoredo Rabelo Ferreira: Center of Informatics, Federal University of Pernambuco (UFPE), Recife 50670-901, PE, Brazil
Robson do Nascimento Fidalgo: Center of Informatics, Federal University of Pernambuco (UFPE), Recife 50670-901, PE, Brazil
Data, 2024, vol. 9, issue 8, 1-24
Abstract:
A Data Warehouse (DW) is a centralized database that stores large volumes of historical data for analysis and reporting. In a world where enterprise data grows exponentially, new architectures are being investigated to overcome the deficiencies of traditional Database Management Systems (DBMSs), driving a shift towards more modern, cloud-based solutions that provide resources such as distributed processing, columnar storage, and horizontal scalability without the overhead of physical hardware management, i.e., a Database as a Service (DBaaS). Choosing the appropriate class of DBMS is a critical decision for organizations, and there are important differences that impact data volume and query performance (e.g., architecture, data models, and storage) to support analytics in a distributed cloud environment efficiently. In this sense, we carry out an experimental evaluation to analyze the performance of several DBaaS and the impact of data modeling, specifically the usage of a partially normalized Star Schema and a fully denormalized Flat Table Schema, to further comprehend their behavior in different configurations and designs in terms of data schema, storage form, memory availability, and cluster size. The analysis is done in two volumes of data generated by a well-established benchmark, comparing the performance of the DW in terms of average execution time, memory usage, data volume, and loading time. Our results provide guidelines for efficient DW design, showing, for example, that the denormalization of the schema does not guarantee improved performance, as solutions performed differently depending on its architecture. We also show that a Hybrid Processing (HTAP) NewSQL solution can outperform solutions that support only Online Analytical Processing (OLAP) in terms of overall execution time, but that the performance of each query is deeply influenced by its selectivity and by the number of join functions.
Keywords: data warehouse; NewSQL; distributed databases; columnar databases; OLAP; HTAP; data modeling; performance analysis (search for similar items in EconPapers)
JEL-codes: C8 C80 C81 C82 C83 (search for similar items in EconPapers)
Date: 2024
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.mdpi.com/2306-5729/9/8/99/pdf (application/pdf)
https://www.mdpi.com/2306-5729/9/8/99/ (text/html)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:gam:jdataj:v:9:y:2024:i:8:p:99-:d:1449846
Access Statistics for this article
Data is currently edited by Ms. Becky Zhang
More articles in Data from MDPI
Bibliographic data for series maintained by MDPI Indexing Manager ().