Big Data Software Analytics with Apache Spark

by Gousios, Georgios

You can get a pre-print version from here.
You can view the publisher's page here.
See the paper's associated code repository: gousiosg/icse-tb

Abstract

At the beginning of every research effort, researchers in empirical software engineering have to go through the processes of extracting data from raw data sources and transforming them to what their tools expect as inputs. This step is time consuming and error prone, while the produced artifacts (code, intermediate datasets) are usually not of scientific value. In the recent years, Apache Spark has emerged as a solid foundation for data science and has taken the big data analytics domain by storm. We believe that the primitives exposed by Apache Spark can help software engineering researchers create and share reproducible, high-performance data analysis pipelines. In our technical briefing, we discuss how researchers can profit from Apache Spark, through a hands-on case study.

Bibtex record

@inproceedings{G18,
  author = {Gousios, Georgios},
  title = {Big Data Software Analytics with Apache Spark},
  booktitle = {Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings},
  series = {ICSE '18},
  year = {2018},
  isbn = {978-1-4503-5663-3},
  location = {Gothenburg, Sweden},
  pages = {542--543},
  numpages = {2},
  doi = {10.1145/3183440.3183458},
  acmid = {3183458},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {Apache Spark, big data, data analytics},
  speakerdeck = {845a7da19b28426b9c96530dfacaa56f},
  github = {gousiosg/icse-tb},
  url = {/pub/sw-analytics-spark.pdf}
}

Presentation

The paper