Customer matching

Got it! This site "www.robinsnyder.com" uses cookies. You consent to this by clicking on "Got it!" or by continuing to use this website. Note: This appears on each machine/browser from which this site is accessed.

Here are some general comments on matching in terms of customers.

The advantage of recognizing the general pattern is that design decisions, database structures, sequential and parallel algorithms, trade-offs' etc., fall into place as they have been well studied and documented in the past. In general, matching falls under the general category of equivalence relations and classes. An equivalence relation is a relation that is reflexive, symmetric, and transitive.

1. A reflexive property is that A matches A
2. A symmetric property is that if A matches B then B matches A.
3. A transitive property is that if A matches B and B matches C then A matches C.

An equivalence relation results in (i.e., induces) equivalence classes. There are known algorithms for the primary operations of equivalence classes, union and find. That is:

1. Find the class of an item.
2. Merge two sets into one class.
3. Find each items in a class (related to #1)

In tracking people, the relation is a "belongs to" relation where each class is a unique person.

So a name "belongs to" the person, an email address "belongs to" a person, etc.

The find operation would be, "who does this email address belong to?".

A union operation would be the merging of two sets of classes whereby a "person" had what had been previously considered two separate social network accounts but, at some point in time, is determined to be the same person.

This involves the transitivity property of the relation.

In group segmentation, the relation may be "belongs to an age group".

The find operation is "to what age group does a person belong" or "who are the members of this age group".

The union operation is to add a new person to an age group once that person's age is determined (e.g., from date of birth).

In a database modeling terms, the matching relationships in a system fall into the categories of "HAS-A" or "IS-A" relationships. Each is modeled/realized differently in a database implementation.

For example, a person "HAS-A" social security number, a person "has an" email address, etc.

Whenever the relationship, or association, is 1 to 1 (e.g., the way Social Security Numbers were designed, to identify a person), then in "A has a B", the B can be stored in the same table row as the A.

If the relationship is 1 to many (i.e., B functionally determines A), then a separate table (i.e., intersection table) is necessary with a pointer/link from the B to the A. If A to B is 1 to 0 (info is missing or optional) or 1 (info is available), then either method can be used (i.e., a null in the table or an auxiliary table). Purists (e.g., relational database purists) advocate the additional table (to maintain strict normal form) while many prefer the null in the table cell.

The "IS-A" relationship is usually modeled as an object hierarchy since, for example, many people can be a member of a group (i.e., a person "is a" member of a group) and that group can be part of a larger group, etc.

Such relationships are many to 1 (perhaps with multiple levels) and are typically implemented/realized in a database using an auxiliary intersection table.

In more flexible implementations/situations, the relations are modeled as dynamic sets using meta-tables (e.g., name-value pair lists, sometimes called property lists) and not as fixed fields in separate tables in a database (which is machine efficient but inflexible for dynamic situations).

The actual matching relation can be done in many ways. Here are some examples.

1. Exact textual matching
2. Pattern textual matching (e.g., using regular expressions)
3. Approximate matching (i.e., using aliases, etc.)
4. Matching to some normal form (e.g., standardized USPS addresses)
5. Semantic mapping matching (e.g. using WordNet)

With incomplete information, a probability can/should be placed on a potential match. This is often done with some form of Bayesian analysis/inference. More involved probability matching can be done with Bayesian networks where the probability calculations become more involved and all depend an the assumptions made in the model.

In such situations, there is a need for some probability to be associated with each piece of information. In most database implementations, the implicit assumption is that the data is 100% accurate although it is well known that almost all databases contain errors (euphemistically called anomalies in database terminology).