Overcoming Common Pitfalls to Data Discovery and Classification
18 DATA SCIENTISTS & SECURITY PROS REVEAL THE MOST COMMON PITFALLS TO DATA DISCOVERY AND CLASSIFICATION
original article by: Ellen Zhang
"A data classification process makes it easier to locate and retrieve data, and it's an important process for any data security program, as well as for risk management and compliance. By leveraging tools that can automatically locate and identify your sensitive data, companies can gain a deeper understanding of what data they possess, where it exists within the organization, and how sensitive it is, allowing them to apply the appropriate level of security to protect the company's most sensitive information.
Despite its importance, many companies struggle with the data discovery and classification process. To gain some insight into the most common pitfalls companies face when it comes to discovering and properly classifying data, we reached out to a panel of data scientists and security leaders and asked them to answer this question":
"WHAT ARE THE MOST COMMON PITFALLS TO DATA DISCOVERY AND CLASSIFICATION AND HOW CAN YOU AVOID THEM?" .....read more
DR. DAVID BAUER
Dr. David Bauer is the CTO at Lucd, Enterprise AI, and the foremost leader in Big Data/Distributed Computing in the U.S. intelligence community since 2005. He has pioneered code that has executed across two million processors and developed the first Cloud and Big Data Platform Certified & Accredited for use in the Federal Government for highly sensitive and classified data.
"Data discovery is a user driven and iterative process of discovering patterns and outliers in data..."
It is not a 'tool,' though tools may aid in the discovery of relationships and models within the data. One of the most common pitfalls in data discovery are tools that are not well suited for business experts. Traditional tools in this area may require coding, such as with Python Notebooks, or overwhelm the user with many uncorrelated charts and graphs. Reporting dashboards today are overly focused on providing a set of graphs on a single web page, leaving it up to the user to determine if there are any correlations in data or between models. A term we coined for this in 2009 is widget vomit.
A successful approach to providing data discovery relies on providing capabilities for data fusion and unification across multiples sources of both internal and external enterprise data. Data fusion relies on entity resolution analytics to compose objects that are pre-integrated and give the user a more complete sense of the information available. Good entity resolution analytics are capable of determining relationships between data objects and models in structured, unstructured and semi-structured data. Allowing the computer to precompile entities across these multiple data sources provides the business user with a much more complete view of the information available, while alleviating the enormous task of disambiguating details across hundreds or even thousands of data sources.
Data classification is made more effective with robust, fused data objects. Sorting fused entities becomes more effective when all of the available features are available within more complete data objects. This eliminates the problem of having to construct complex queries and categorizations across multiple, disparate data sets.
The majority of data management platforms today do not implement data fusion, and so miss the opportunity to provide the business user with a more complete and constructed view of the data landscape - which places more of the work on the analyst. The task of extraction, correlation, categorization, and disambiguation across billions of records from thousands of sources is precisely what we have built at Lucd, to be able to generate analytic models with much greater accuracy. Data fusion improves model accuracy by lowering the amount of data being filled with synthetic values, aggregating away resolution in the data or dropping data due to empty values. In the Lucd platform, data fusion automatically creates more complete data objects, which leads to more accurate and complete analytic results.