R and Spark: Tools for Data Science Workflows. A new NISS course

NISS is introducing a two-day "R and Spark: Tools for Data Science Workflows" course to enable statisticians to work with Big Data. The course will be offered on September 14-15, 2017 at the American Statistical Association offices in Alexandria, Va., and on September 30 - October 1, 2017, at the University of California Riverside.

"This course enables statisticians to expand their data analysis skills, tools, and workflows in a natural way to those required for Big Data," says E. James Harner, Professor Emeritus of Statistics and Adjunct Professor of Management Information Systems at West Virginia University.

Developed in the early nineties as an open-source alternative to Bell Labs’ S statistical language, R is a flexible, extensible statistical computing environment, but limited to single-core execution. Spark is a relatively new distributed computing environment, which extends R, a first-class programming language to multiple processors.

Spark has increased the effectiveness and efficiency in the way Big Data is analyzed. It is used by major search engine organizations, such as Google and Yahoo, and LinkedIn and Amazon use it to match advertisements to users in smart ways.

Another advantage of Spark and R is that they are open source and available without cost.  Therefore they can be included in derivative products and packages built for new and novel applications that benefit from the efficiency of Spark.

The course is useful to graduate students and data analysts who work with Big Data, and to federal employees who want to gear up for the emerging role of Big Data in government. "Big Data is coming on strong in many fields and we encourage people interested in Big Data to be a part of this R & Spark course taught by Harner, who is an expert in computational statistics and statistical machine learning and the chairman of the Interface Society," says David Banks, NISS Assistant Director.

Course Outline

The course covers the initial steps in the data science process:

  • extracting data from source systems
  • transforming data into a tidy form
  • loading data into distributed file systems, distributed data warehouses, and NoSQL databases, i.e., ETL
  • importing data into Spark for transformation and modeling workflows
  • using supervised learning to build and evaluate models
  • using unsupervised learning to structure data
Wednesday, September 6, 2017 by Mearl Colaco