EconPapers    
Economics at your fingertips  
 

Guide Pratique de PySpark pour Data Engineer: Fonctions Usuelles et Exemples d’Applications

Practical Guide of PySpark for Data Engineer: Common Functions and Application Examples

Moussa Keita

MPRA Paper from University Library of Munich, Germany

Abstract: The area of Big Data is commonly characterized by situations where the volumes of data are such that it is impossible to store and process them on a single machine. Data are stored across a group of machines called "cluster". However, new technological solutions had to be imagined by IT engineers in order to be able to process and exploit the data distributed across a cluster. Apache Spark is one of the proposed solutions. Spark is an framework that allows applying parallel computations to data stored on several cluster nodes. PySpark is the implementation of the Spark framework in the Python programming language. The purpose of this document is to review the common parallel processing functions used by Big Data Engineers using PySpark.

Keywords: RDD; Dataframe; Big Data; PySpark; Hive; HDFS; csv; kafka (search for similar items in EconPapers)
JEL-codes: C8 (search for similar items in EconPapers)
Date: 2022-06
New Economics Papers: this item is included in nep-dem
References: View complete reference list from CitEc
Citations:

Downloads: (external link)
https://mpra.ub.uni-muenchen.de/113562/1/MPRA_paper_113562.pdf original version (application/pdf)

Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.

Export reference: BibTeX RIS (EndNote, ProCite, RefMan) HTML/Text

Persistent link: https://EconPapers.repec.org/RePEc:pra:mprapa:113562

Access Statistics for this paper

More papers in MPRA Paper from University Library of Munich, Germany Ludwigstraße 33, D-80539 Munich, Germany. Contact information at EDIRC.
Bibliographic data for series maintained by Joachim Winter ().

 
Page updated 2025-03-19
Handle: RePEc:pra:mprapa:113562