GitHub API based QuantNet Mining infrastructure in R
Lukas Borke and
Wolfgang Härdle
No 2017-008, SFB 649 Discussion Papers from Humboldt University Berlin, Collaborative Research Center 649: Economic Risk
Abstract:
QuantNet being an online GitHub based organization is an integrated environment consisting of different types of statistics-related documents and program codes called Quantlets. The QuantNet Style Guide and the yamldebugger package allow a standardized audit and validation of YAML annotated software repositories within this organization. The behavior statistics of QuantNet users are measured with Web Metrics from Google Analytics. We show how the search queries obtained from Google's metrics can be used in the test collections in order to calibrate and evaluate the information retrieval (IR) performance of QuantNet's search engine called QuantNetXploRer. For that purpose, different text mining (TM) models will be examined by means of the new TManalyzer package. Further, we introduce the Validation Pipeline (Vali-PP) and apply it on the YAML data. Vali-PP is a functional multi-staged instrument for clustering analysis, providing multivariate statistical analysis of the co-occurrence distribution of driving factors of the pipeline. The new package rgithubS, which enables a GitHub wide search for code and repositories using the GitHub Search API and which is an essential element of the QuantNet Mining infrastructure, is briefly presented. The TManalyzer results show that for all considered single term queries the number of true positives is maximal in a latent semantic analysis model configuration (LSA50). The Vali-PP analysis indicates that the optimality of the combination LSA50 and hierarchical clustering (HC) applies to 70 ? 90% of the cluster sizes for most of the considered quality indices. Further, we can infer that more accurate and comprehensive metadata increases the clustering quality. Subsequently, the findings of our experimental design are implemented into the QuantNetXploRer. The GitHub API driven QuantNetXploRer can be found and mined under http://www.quantlet.de
Keywords: Code Search; Software Repositories; Text Mining; Information Retrieval; Smart Data; YAML; GitHub Search API; Google Analytics; Web Metrics; LSA; GVSM; Cluster Validation; Validation Pipeline (search for similar items in EconPapers)
Date: 2017
References: View references in EconPapers View complete reference list from CitEc
Citations:
Downloads: (external link)
https://www.econstor.eu/bitstream/10419/162509/1/882061216.pdf (application/pdf)
Related works:
This item may be available elsewhere in EconPapers: Search for items with the same title.
Export reference: BibTeX
RIS (EndNote, ProCite, RefMan)
HTML/Text
Persistent link: https://EconPapers.repec.org/RePEc:zbw:sfb649:sfb649dp2017-008
Access Statistics for this paper
More papers in SFB 649 Discussion Papers from Humboldt University Berlin, Collaborative Research Center 649: Economic Risk Contact information at EDIRC.
Bibliographic data for series maintained by ZBW - Leibniz Information Centre for Economics ().