A New Paradigm: R and Spark tools for Data Science Workflows

The two-day "R and Spark: Tools for Data Science Workflows" course was introduced by NISS in September 2017. The course was delivered by data science expert E. James Harner, Professor Emeritus of Statistics and Adjunct Professor of Management Information Systems at West Virginia University. The first course was held at the American Statistical Association in Alexandria, Virginia, and the second one at the University of California, Riverside campus. The upcoming R and Spark courses are scheduled at the H2O Headquarters in Mountain View, California, on Feb 22-23, 2018; in Toronto, Canada on April 12-13, 2018; and in Washington, DC on May 30-31, 2018.

Both R and Spark big data environments primarily work on rectangular data structures called data frames. R does analyses on a single computer core, whereas Spark distributes its computation over hundreds or thousands of nodes. "As a result, Spark’s computational load and storage capacity are both very large, but the applicable algorithms are limited. On the other hand, R has a huge number of useful algorithms, but its storage capacity and computing resources are limited," says, Harner, as he speaks about how the R and Spark data frameworks complement each other.

The R and Spark course held at the University of California, Riverside campus, although open to the public, was attended by many students, since the course was priced at a discounted rate to make it affordable for students. In addition, UCR's faculty partly supported the student registration fees for UCR students. "The course is beneficial to multiple audiences," says Dan Jeske, Professor of Statistics at the University of California, Riverside.  He adds, “The course is valuable to statisticians and data scientists trying to understand the distinction between statistics and data science; to students about to enter the workforce; and to employers who want to encourage their in-house data analysts toward a valuable form of continuing education.” Additionally, this course is valuable to people from a non-statistics background, especially those in a substantive area that uses statistics. The only pre-requisite is that everyone attending the course should have some statistical training and basic knowledge of R.

Big Data and Data Science in statistics
There's a lot of emphases and talk around Big Data and Data Science in the statistical world today. While Big Data is a catch phrase for any type of data - sensors, images, videos, natural language - Data Science is the study of Big Data workflows. Explaining the life cycle of extracted or captured data, Harner says, “First data is extracted from a source. It is then cleaned and transformed into a structured form (perhaps with preliminary analyses) before various machine learning algorithms are applied. The workflow proceeds to graphics and reproducible documents and perhaps to the creation of a data product before cycling back to the source for more data.”

Takeaway from the R and Spark course
The course presents a new paradigm of thinking about the analysis of data. The emphasis is on building workflows from data sources to data products with the use of prediction metrics rather than p-values (probability values) as the measure of success.

“The course helps participants, especially students, to make a shift in the way they think about data," says Harner. Giving an example, Harner says, "Participants can start thinking from a procedural to a functional style of programming, from a single file system to distributed storage models, and from a distributed computational model brought to the data rather than moving big data to the computation."

The course is taught in a virtual environment called rspark. The rspark environment contains R, Hadoop, Spark, a relational database and many other data technologies, which can be deployed locally on a student’s computer or on Amazon’s Web Services cloud, so that participants get a hands-on training on it. "In fact, rspark was built to teach courses like this one," says Harner, adding, "Thus enabling students to have a learning environment that operates identically to the actual large-scale computer clusters."

Upcoming R and Spark courses:

  1. Mountain View, California, February 22-23, 2018
  2. Toronto, Canada, April 12-13, 2018
  3. Washington, DC, May 30-31, 2018
Wednesday, January 17, 2018 by Mearl Colaco