How to Avoid the Pitfalls of Data Discovery
by; Dr David Bauer, CTO-Co-Founder, Lucd
What are the most common pitfalls to data discovery and classification and how can you avoid them?
Data discovery is a user driven and iterative process of discovering patterns and outliers in data. It is not a 'tool', though tools may aid in the discovery of relationships and models within the data. One of the most common pitfalls in data discovery are tools that are not well suited for business experts. Traditional tools in this area may require coding, such as with Python Notebooks, or overwhelm the user with many uncorrelated charts and graphs. Reporting dashboards today are overly focused on providing a set of graphs on a single web page, leaving it up to the user to determine if there are any correlations in data or between models. A term we coined for this in 2009 is "widget vomit".
A successful approach to providing data discovery relies on providing capabilities for data fusion and unification across multiples sources of both internal and external Enterprise data. Data fusion relies on entity resolution analytics to compose objects that are pre-integrated and give the user a more complete sense of the information available. Good entity resolution analytics are capable of determining relationships between data objects and models in structured, unstructured and semi-structured data. Allowing the computer to precompile entities across these multiple data sources provides the business user with a much more complete view of the information available, while alleviating the enormous task of disambiguating details across hundreds or even thousands of data sources.
Data classification is made more effective with robust, fused data objects. Sorting fused entities becomes more effective when all of the available features are available within more complete data objects. This eliminates the problem of having to construct complex queries and categorizations across multiple, disparate data sets.
The majority of data management platforms today do not implement data fusion, and so miss the opportunity to provide the business user with a more complete and constructed view of the data landscape -- which places more of the work on the analyst. The task of extraction, correlation, categorization and disambiguation across billions of records from thousands of sources is precisely what we have built at Lucd, to be able to generate analytic models with much greater accuracy. Data fusion improves model accuracy by lowering the amount of data being "filled" with synthetic values, aggregating away resolution in the data or dropping data due to empty values. In the Lucd platform, data fusion automatically creates more complete data objects, which leads to more accurate and complete analytic results.