TW: Data Completeness is a Farce
Data will never be complete as it is an inherent goal to obtain more, and better, information
It has been 5 years since The Economist published its iconic cover, "The world’s most valuable resource is no longer oil, but data." In hindsight, it captured an important juncture in the Digital Revolution and signaled a societal milestone. Not only would businesses be driven by data but their very value hinges on its application. Industry giants and businesses with aspirations of influence have since been wrangling, centralizing, and securing their data. When we talk about the digital divide, it’s not only at the atomically personal level but also cuts across businesses, industries, sectors, regions, and countries.
Not only would businesses be driven by data but their very value hinges on its application.
Savvier organizations have been finding ways to go beyond surfacing business insights. They dynamically automate routine decisions within the business’ operations by piping their data into machine learning (ML) engines. In 1980, Michael Porter described an initial set of three generic strategies for businesses and these strategies remain relevant when applying ML:
Cost leadership: being the low-cost producer
Differentiation: being unique on dimensions that customers value such as quality, personalization, etc.
Focus: tailoring products to a narrow segment of customers
Business leaders embedded in organizations who had not been early adopters and are now “catching up” with data often encounter a seemingly insurmountable wall.
Picture it:
You’ve been in or recently joined an organization. For some time now, the business has been improving its databases, some even growing this thing called a data lake. After a recent re-org, you’re now tasked with finding a way to make use of the data lake in your department. So, you bring a business analyst onto your team to parse through and make sense of the data. The analyst at some point tells you the team is going to need a data scientist to experiment with finding a use for the data beyond regurgitating analysis and insights from it. Before you know it, it’s been a year and some change. While you have more insights from the data and you’ve got some models in the queue, none of the work has made it into production. The reason given is that the data lacks completeness.
How devastating—so much effort, time, and resources poured into collecting the data, cleaning and scrubbing the data, analysis of the data—only to be told it's incomplete! Data completeness is one of the primary excuses business leaders may hear for not adopting, releasing, or leveraging ML sooner, and yet—business decisions must still be made and the landscapes businesses play in are ever-changing. So is data completeness obtainable?
Data completeness is one of the primary excuses business leaders may hear for not adopting, releasing, or leveraging ML sooner.
The important thing here is to remember the value of data obtained in the “now.” It’s the context we do have and that businesses understand the landscape to be comprised of in the present. No landscape is a stagnant rendering like the ones we see in museums. Rather, they are full of dynamism, competition, and vastness. As we all experienced in the wake of the pandemic, businesses must dynamically contend with whatever “now” is—ever learning and adapting to the changes in the landscape.
Are there gaps in the manner in which businesses collect data or even in the data lake?
Certainly.
Are these gaps opportunities to get more contextual information?
Surely.
Does the business have time to wait yet another year to have models in production? NO!
The important thing here is to remember the value of data obtained in the “now.”
The truth is: data will never be complete as it is an inherent goal in any business to obtain more information that provides better context as decisions continue to be made.