The Importance of Exploratory Data Analysis (EDA)
by Joel Branch, Ph.D. AI Architect
After years of claiming to be a great chef, I realized my cooking approach has been rather naïve. I’ve got some recipes memorized, but I’ve been “ignoring the food.” An expert chef has expertise in “preparing food,” not just following a recipe. Are the eggs fresh? Fresh eggs make “outstanding” cakes. The fire is right, but how does the steak sizzle? You don’t want the surface to cook much faster than the interior. A recipe may say one thing, but you may need to switch it up. The trick here is “knowing your food” before and during cooking.
Do you “know your data?” Doing exploratory data analysis (EDA) is essential before coding up an impressive deep learning algorithm. Unfortunately, there’s not enough respect given to EDA because of impatience, and the end goal, deep learning, steals most of the spotlight. Thorough EDA is critical to help pick the right learning algorithm, sample the right data for training, predict how often retraining might be needed, etc. An exhaustive EDA discussion is beyond this blog's scope, but let’s cover some general best practices.
Of foremost importance is understanding what data you have and don’t have. Examples include having data for only select parts of the year or user populations. You also need to be prepared to make a strong case for the data you need, but partners are hesitant or unwilling to release (they don't teach you this in deep learning courses). Having an overall understanding of data availability will help determine to what extent a learning solution can even be applied because guess what, bias is a killer.
You also need to determine the quality of data at your disposal. Are there missing attributes, noisy sensor readings, data duplications, etc.?
Beyond developing effective reusable data cleaning pipelines, you may need to take time to understand the data's source. Large well-established enterprises (many of whom are anxious for AI transformations) usually have no shortage of esoteric systems generating data intended to fuel AI solutions. Find out all you can about systems and processes generating your data as soon as possible. A data source modernization effort might be needed before embarking on AI solutions.
Before proceeding much further, you also need to establish a clear data security solution. Today’s press is riddled with stories about unauthorized data access and analysis. Knowing and controlling who should have access to what features of a dataset is critical for enforcing data privacy controls and policies. Data-level security sets expectations about what solutions can be realized by applying deep learning (e.g., the specificity of output of a recommendation engine).
Finally, get dirty with the data. Understanding things like relationships among data features and distributions of categorizations in multi-dimensional space is essential for determining what types of learning models should be considered (or if you need AI at all) and for further determining what data cleaning is required. In the world of big data, understanding the tradeoffs between efficacy and performance of sampling techniques may be critical for early prototyping, especially in selecting data slices for training, testing, and validation. Become familiar with the capabilities of data visualization tools and frameworks. Depending on your problem domain, you might even end up developing a novel way to visualize your data.
So, today I’m a more confident chef. I’m also more confident in my approach to applying deep learning because I’ve learned to spend more resources on knowing my data, and convincing other stakeholders of the importance of this preliminary phase on the way to AI greatness.