Data was always an essential resource. Next to human curiosity and perseverance, it is one of the driving factors for the human race's everlasting progress. But today, its importance cannot be overstated. Almost every event that occurs in the world gets tracked and recorded.
But for all that data to be truly useful, either for data analysis, machine learning or just by providing valuable insight about specific events we are tracking, it needs to be prepared. This is where data preprocessing comes into play. It is a process that takes raw data and transforms it into a format more understandable for humans and machines. The options are limitless, and the real mission is to understand what needs to be done to achieve our goals. Once we figure out the requirements, the pipeline needs to be split into steps, and when the steps are set, execution is usually pretty straightforward.
I want to note, especially when diving into an unexplored area, that a lot of this is based on trial and error. There will (and in my opinion should) be a lot of 'back to the drawing board' moments. Do not fear those because, usually, the greatest breakthroughs come from them. As Edison would say: "I have not failed 10,000 times; I've successfully found 10,000 ways that will not work."
Since we said a lot depends on the use case, let's create a specific situation for us to investigate and solve.
Setting the stage
We are running a web store application and storing every order/purchase in our Mongo database. We would like to prepare this data for analysis and, eventually, to create a recommendation engine that could potentially increase the sales we make on our website.
Consider one of the entries in our database looks like this:
{
"order_id": uuid,
"user": {
"id": uuid,
"device": string,
"geo": {
"name": string
"surname": string,
"address": string,
"city": string,
"country": string
}
},
"datetime": timestamp,
"items": [
{
"id": uuid,
"quantity": integer,
"size": string,
"color": string
}, ...
]
}
Without getting into too much detail about the recommendation engine, we can simply say that user and purchased item data needs to be cleaned up and prepared for the next steps.
Data Flattening
In case this was a relational database, storing data in tabular format, we would be in a more straightforward position to start our data preprocessing. However, NoSQL databases storing data in key-value pair maps are rising in popularity. Since most standard machine learning algorithms require data in tabular format, there's one extra step to consider moving forward: data flattening.
It's the process of parsing and dissecting complex data types into primitive ones. We can achieve this by creating a function that will accept the data point (dictionary, object) containing said complex data types and return a flattened version. In our case:
def process(data_point):
user_data = data_point.get("user")
geo_data = user_data.get("geo")
common_values = {
"datetime": data_point.get("datetime"),
"user_id": user_data.get("id"),
"user_device": user_data.get("device"),
"user_address": geo_data.get("address"),
"user_city": geo_data.get("city"),
"user_country": geo_data.get("country"),
}
return [
{
**common_values,
**{
"item_order": i,
"item_id": item_data.get("id"),
"item_quantity": item_data.get("quantity"),
"item_size": item_data.get("size"),
"item_color": item_data.get("color")
}
} for i, item_data in enumerate(data_point.get("items"))
]
which will return a list of dictionaries like this:
{
"datetime": timestamp,
"user_id": uuid,
"user_device": string,
"user_address": string,
"user_city": string,
"user_country": string,
"item_order": integer,
"item_id": uuid,
"item_quantity": integer,
"item_size": string,
"item_color": string
}
Let's go through this function together. When we look at the data we have in our data point, we can see that we have some data related to the purchase specifically (id and timestamp), some data related to the user in a dictionary that needs to be parsed, and some item data represented as a list of dictionaries.
Our goal is to transform this data into a dictionary without any nesting. That means we can copy the purchase-specific data and extract necessary data from the user dictionary. However, when it comes to the item data, every dictionary inside the list can be treated as its data point. This is why we grouped the purchase and user data we want to keep (common_values in our code snippet above), and we will append that data to every item element from our list. This will, for example, create five rows of data for a data point containing five items. If this seems like bloating our dataset and adding complexity, I urge you to have a bit of patience. Everything will become apparent soon.
However, there are ways we can reduce the amount of data we send over to the following processing step. If there is some information we're certain we won't be able to use, we can leave it at this step. For example, purchase ID will not serve any purpose in either data analysis, nor will we be able to improve our machine learning model with that bit of data. One possible use for this field we can discuss is backtracking, linking a specific data point to a particular purchase, but as our dataset gets larger and larger, it only reduces in significance.
We can also discard very user-specific data, like names and surnames. The likelihood that all (or most, at least) Johns and Marys have the same shopping habits is very low. In our case, the address also seems too specific, but that's also up for discussion.
After transforming our collection of complex dictionary data points into a list of non-nested simple dictionaries, we convert it to a tabular format, where every element of that said list is a row, and every key of that resulting dictionary is a column.
Feature Selection
Next up is feature selection. It is a process of selecting a subset of relevant features from our dataset. In this step, we should choose and isolate the features that describe our dataset to the best of its abilities. It would be beneficial if we had some in-depth knowledge of the topic we are researching. Proper feature selection improves model training, scoring, and performance.
For further analysis, we should split our dataset into user, item, and user-item datasets. Our current dataset can serve as our user-item dataset, and it would have these columns:
datetime
user_id
user_device
user_address
user_city
user_country
item_order
item_id
item_quantity
item_size
item_color
For our user dataset, we could select these columns:
datetime
user_id
user_device
user_address
user_city
user_country
and drop duplicate rows, since we filtered out item data, which duplicated our purchase and user data for each row.
This leaves us with item dataset, which would contain these columns:
datetime
item_order
item_id
item_quantity
item_size
item_color
Some examples of features that are not necessary for the dataset are: all or most records having the same value for that feature, the importance of the column is not that high, or it makes no difference in the prediction outcome.
Reformatting
Next up is reformatting. It is the process of converting available features from one format to another, usually a more usable one. We can see this happening with dates, times, geolocations, and nested features.
For example, let's consider the following timestamp:
2022-03-14, 13:30:00
Out of that one piece of data, we can extract the following (ignoring the obvious, that is, year, month, date, etc.):
Day of month: Monday
Time of day: Afternoon
Those data bits can give us additional insight into our customers' shopping habits.
In case we're storing our timestamps in UNIX (epoch) format, the number of possible extracted features increases since we'll be able to extract the features we considered evident in the previous example.
Remember that we can include this step in the data flattening step.
Conclusion
With these steps in place, we should have three separate datasets of prepared data ready for future processing. But our data preprocessing journey has only started. There are still a couple of steps we need to take to get our data ready for analysis, and after that, to our machine learning algorithms. In the second part of this article, we will get into more detail for every dataset specifically, so keep an eye out.


