• Published
  • By
  • 4 mins read

Anyone who has read my blogs or listened to me talk knows that I am all about the problem, well
in this blog let me examine what is not the problem: It is commonly understood that data
scientists spend between 75 to 85 percent of their time undertaking data preparation ready for
evaluation. First let me say that this is your job data scientist so if you are a data scientist stop
whining about it! If you are no good at data preparation, you are no good as a data scientists full

In my opinion, the validity of any analysis rests almost completely on the data preparation. The
algorithm you end up using is close to irrelevant. Complaining about data preparation is the
same as being a farmer and complaining about having to do anything but harvesting and please
have somebody else deal with the pesky watering, fertilizing, weeding, etc.

This being said – data preparation can be made difficult by the process of raw data collection.
Designing a system that collects data in a form that is useful, easily digestible and expresses all
the conditions/data attributes to examine the defined problem by data science is a high art.
Providing full transparency to data scientists how exactly the data flows to the system is another.
It involves processes that consider sampling, data annotation, matching, etc. It does not include
things like replacing missing value and excessive normalization. Creating an effective data
environment for data scientists needs to involve data scientists and cannot be entirely owned by
Engineering. Data scientists and Engineering are often NOT able to spec such system
requirements in sufficient detail to allow for a clean handover so they need to work together to do
this not work in isolation of each other.

But in the bigger picture, there are more important things to consider. The biggest issue I see in
many organisations is data science solving problems that are not worth solving! I have discussed
this in great detail in my articles about Problem Derived Innovation Analytics (PDIA SM ). This is a
huge waste of time and energy. The reason is typically that whoever has the problem is lacking
data science understanding to even express the issue and data scientists end up solving whatever
they understood might be the problem, ultimately creating a solution that is not really helpful
(and often far too complicated). A typical category is ‘under defined’ tasks: “Find actionable
insights in this dataset!”. Well – most data scientists do not know which actions can be taken.
They also do not know what insights are trivial vs. interesting. So, there is really no point sending
them on a wild goose chase. The problems that are worth solving should be aligned with the
business strategy, be achievable within the data maturity level of the orginisation and have a
positive business outcome in terms of efficiency savings, revenue generation or some other
measurable business outcome.

The “solving the wrong problem” is pervasive in part because the data science is not sufficiently
involved in the decision process. Now – not every data scientist can nor should be expected to be
able to shape the problem as well as the solution, but at least one data scientist on the team
should be part of the problem examination loop (PDIA SM ). The bigger issue is however not the
lack of ability/willingness from the data science side (although indeed there are plenty who just
like to solve a cute problem, not matter how relevant) – but often a corporate culture where
analytics, IT, etc. is considered an ‘execution’ function. Management decides what is needed and
everybody else goes and does it, let’s stop this way of thinking.

On an individual level and a given, worthwhile problem we could blame lack of data
understanding, data intuition, and finally scepticism as the most limiting factors to efficiency.
What makes these factors contribute to inefficiency is not that it takes longer get to an answer (in
fact lack of the three typically leads to results much more quickly) but rather how long it takes to
an (almost) right answer.

Problem definition and execution should be a cross business capability with any individual in the
organisation being able to subscribe a problem to a robust Lean Problem examination
methodology of problem examination and approval prior to the final algorithm being applied to
the problem to gain the insight (PDIA SM )!