上QQ阅读APP看书，第一时间看更新

Understanding the problem

When solving machine learning problems, it's important to take time to analyze both the data and the possible amount of work beforehand. This preliminary step is flexible and less formal than all the subsequent ones on this list.

From the definition of machine learning, we know that our final goal is to make the computer learn or generalize a certain behavior or model from a sample set of data. So, the first thing we should do is understand the new capabilities we want to learn.

In the enterprise field, this is the time to have more practical discussions and brainstorms. The main questions we could ask ourselves during this phase could be as follows:

What is the real problem we are trying to solve?
What is the current information pipeline?
How can I streamline data acquisition?
Is the incoming data complete, or does it have gaps?
What additional data sources could we merge in order to have more variables to hand?
Is the data release periodical, or can it be acquired in real time?
What should be the minimal representative unit of time for this particular problem?
Does the behavior I try to characterize change in nature, or are its fundamentals more or less stable through time?

Understanding the problem involves getting on the business knowledge side and looking at all the valuable sources of information that could influence the model. Once identified, the following task will generate an organized and structured set of values, which will be the input to our model.

Let's proceed to see an example of an initial problem definition, and the thought process of the initial analysis.

Let's say firm A is a retail chain that wants to be able to predict a certain product's demand on certain dates. This could be a challenging task because it involves human behavior, which has some non-deterministic components.

What kind of data input would be needed to build such a model? Of course, we would want the transaction listings for that kind of item. But what if the item is a commodity? If the item depends on the price of soybean or flour, the current and past harvest quantities could enrich the model. If the product is a medium-class item, current inflation and salary changes could also correlate with the current earnings.

Understanding the problem involves some business knowledge and looking to gather all the valuable sources of information that could influence the model. In some sense, it is more of an art form, and this doesn't change its importance a little bit.

Let's then assume that the basics of the problem have been analyzed, and the behavior and characteristics of the incoming data and desired output are clearer. The following task will generate an organized and structured set of values that will be the input to our model. This group of data, after a process of cleaning and adapting, will be called our dataset.