Guide Pratique de PySpark pour Data Engineer: Fonctions Usuelles et Exemples d’Applications
Practical Guide of PySpark for Data Engineer: Common Functions and Application Examples
Moussa Keita
MPRA Paper from University Library of Munich, Germany
Abstract:
The area of Big Data is commonly characterized by situations where the volumes of data are such that it is impossible to store and process them on a single machine. Data are stored across a group of machines called "cluster". However, new technological solutions had to be imagined by IT engineers in order to be able to process and exploit the data distributed across a cluster. Apache Spark is one of the proposed solutions. Spark is an framework that allows applying parallel computations to data stored on several cluster nodes. PySpark is the implementation of the Spark framework in the Python programming language. The purpose of this document is to review the common parallel processing functions used by Big Data Engineers using PySpark.
Keywords: RDD; Dataframe; Big Data; PySpark; Hive; HDFS; csv; kafka (search for similar items in EconPapers)
JEL-codes: C8 (search for similar items in EconPapers)
Date: 2022-06
New Economics Papers: this item is included in nep-dem
References: View complete reference list from CitEc
Citations:
Downloads: (external link)
https://mpra.ub.uni-muenchen.de/113562/1/MPRA_paper_113562.pdf original version (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:pra:mprapa:113562
Access Statistics for this paper
More papers in MPRA Paper from University Library of Munich, Germany Ludwigstraße 33, D-80539 Munich, Germany. Contact information at EDIRC.
Bibliographic data for series maintained by Joachim Winter ().