This year’s Valentines Day has long since passed. Even so, the Dashmote office continues to make matches! Data matching, that is. We’ve asked two of our data scientists, Matthieu and Laura, to explain how the process of matching data works.
Q: What’s the main challenge you face in doing your job?
LAURA: Definitely dealing with large sets of data. This presents our team with a variety of challenges, and one of those challenges is making sure these sets of data are clean and flawless.
MATTHIEU: I agree. At Dashmote, most of the data we work with is unstructured. As any data scientist knows, If structured data is relatively reliable and comfortable to work with, then unstructured data is often messy. We see a lot of unstructured data in the real world, and it’s a whole different ball game when it comes to sorting out meaningful insights.
LAURA: Also, this data sets come from multiple sources and that makes it even more difficult to match and merge the different sets. None of these sources apply the same identification rules. This is what we call “fuzzy merging”.
When we have to match data from different sources we look at the “distance” between the words turning them into numbers and checking for similarities. An example? Say you have a name of a dish that in English is different from Italian: only a few characters in the two words are different, and so you can look at the distance in between the words turning them into numbers and checking for similarities.
Clearly “fuzzy merging” is challenging, because some words are similar in terms of characters but their semantic meaning is different, so you keep having to reconsider the threshold on which this fuzzy matching is taking place at, not to make silly mistakes.
Q: So what’s the art of data matching?
MATTHIEU: Let’s take an example of outlet ‘A’ and outlet ‘B’. The associated data comes from two different sources, but we humans can easily conclude that it’s the same place. Same address, same phone number, and – although in a different order – also the same name!
For a machine, though, this isn’t quite as easy. Our machine isn’t a regular of ABC’s Pizzeria, and it doesn’t live in Amsterdam. So we need to help our machine figure out that these two outlets are one and the same.
LAURA: Yes, it’s a process made of different steps. First of all, we standardize the outlet information using a tool that finds common denominators.
In addition, we use ‘regex patterns’ to keep relevant information while removing unnecessary characteristics. Regex (regular expression) is a sequence of characters that define a search pattern. Usually this pattern is used by string searching algorithms. As a result, we end up with two identical outlet names written in lowercase, ‘abc’ and ‘abc’.
MATTHIEU: And this was the relatively easy part!
LAURA: Right, because the matching process of outlet addresses is a bit more complicated. An address can be written in a variety of ways. Every single country has its own way of writing down an address.
So, how do we fix it? Well, we developed our own API to structure it in a uniform manner. This structure doesn’t apply to all cities and countries, though. This is very important to take into consideration, because an address in New York will start with the number of the house, while in Tokyo, they use the block of a building instead of a street name to indicate a location. Our database should be able to hold both types of data without the potential to generate incorrect output.
MATTHIEU: Don’t forget the phone number. We need to take out anything that has the potential of deceiving our machine in trying to examine whether or not both phone numbers are one and the same. Taking into account country-specific prefixes and other differences, we end up the same phone number of both outlets.
Q: So then we have a match?
MATTHIEU: So, we have normalized the name, address and phone number of the outlet. In order to see whether or not there’s a match, we then apply a statistical model to calculate the ‘distance’ between both outlets.
By distance, we mean the distance between the two addresses and the distance between both outlet names. What’s the distance between the names, you ask? Well, an example of this is the classical Levenshtein Distance Model. The outlet names can be translated into two vectors, which enables us to calculate the distance between both names. The figure below shows how this looks like in theory.
In this case, the phone number is of less significance because outlets often use different phone numbers in different settings. Think of different branches or a small business that uses both the entrepreneur’s private and professional phone number.
Q: And then is it finally a match?
LAURA: Yes, then it’s a match! Our machine learning algorithm gives us a statistical percentage indicating whether these two outlets are one and the same–or not. Depending on the result, we’re able to match the outlets or add both the different outlets to our data set. This way, unstructured data turns into structured data, and we keep on building our own uniform and clean database.
MATTHIEU: So you see, here at Dashmote every day is Valentine’s Day!