October 21, 2020
Lee Wilkinson (H2O, and University of Illinois at Chicago) led the second of ten tutorials scheduled as part of the NISS Essential Data Science for Business series - the Top 10 analytics key topics that are used in business today! Students and faculty please note: these are perhaps the top ten most important and practical topics that may not be covered in your program of study. (Review the Overview Presentation about all 10 Sessions).
If there is just one person that you would want to talk to when it comes to understanding data and the role that graphic visualization plays -- Dr. Wilkinson is the person to talk to! This tutorial session included example after example from all sorts of different contexts in which Lee pointed out the pros and cons of each rendering based on its purpose and the understanding that it provides readers. Through these examples Lee made it clear that while many may disregard or don’t understand the small details that go into a representation, even small nuances can have big implications.
The Topics
Lee organized the tutorial session into three sections. In the first section he focused on Data. Understanding data, how it is organized and processed is clearly a foundation upon which so much is built, especially when it comes to really big data. Without sufficient clarity of understanding where the data are coming from, when it comes to millions, or billions of rows of data and millions of differing features, a solid foundation of understanding is critically important. Sketching, principle components, random projections and manifold learning are just some of the topics that he touched on.
In the second section Lee focused his remarks on Visualization. Is “a picture worth a thousand words?” Is it true that “less is more?” Neither is true for Lee, and he proceeded to demonstrate exactly why these statements are problematic. Color? 3D? When do these attributes become important? How should you handle examples with one variable, two continuous variables, mixed variable types, etc.? Does your data represent space or time? What works well in different circumstances? What doesn’t? Why? Mosaic plots, heatmaps, scatterplot matrices, scagnostics, sketch graphics … There are so many options and Lee provided insight into so many of these.
In the third section Lee’s focus was split between talking about visualization when it comes to Models and Machine Learning. He reviewed how models are used in exploring, or as Tukey’s book labeled it early on, Exploratory Data Analysis. Lee first reviewed various approaches to summaries, transformations, smoothing and from here moved to approaches to handing qualitative data analysis alternatives. This first step led naturally to talking about summarizing data and inference making and the role that graphs might play in each of these steps. Lee concluded the session discussing machine learning methods that look for patterns that persist across large collections of data, both supervised and unsupervised. In doing so he touched on topics such as dimensionality, kernels, bagging, boosting, ensembles, among many others along with the principle scientists involved in this work.
Wow! This session certainly covered a wide range of topics and issues, but the constant thread throughout was the role of visualization of data
Access to Materials
Once again, this NISS tutorial session was certainly filled to the brim with details. If you were not able to attend this live session you can still access a recording of the session along with links to the slides that Lee used during this session. Use the Registration Option "Post Session Access" on the event webpage, pay the $35 fee, and NISS will provide you with access to the materials for this session. Or register for the full series of ten tutorials, and NISS will provide all the links as well.
What’s Up Next?
The next NISS Essential Data Science for Business tutorial is scheduled to take place on Wednesday, November 4, 2020. Yanling Zuo (Minitab) will be the instructor for the next topic, “Predictive Analytics and Machine Learning.” Register today!
Further schedule dates/topics include:
November 18, 2020 - Victor Lo & Dominique / Jonathan Haughton: "Causal Inference and Uplift Modeling"
December 2, 2020 - Ming Li: "Deep Learning"