In this post I will show how to cleanse address data using Microsoft Data Quality Services (DQS) and Melissa Data as a reference data service. First, you need to make sure you have a working installation of DQS. If you don’t, you can view my other post here that walks you through installing DQS. Once DQS has been installed, you will need to visit the Microsoft Azure Marketplace website and subscribe to the Melissa Data Address Check service. When I wrote this post, you could get 1,000 addresses cleansed per month for free. This is useful for testing or if you have a small set of data to be cleansed. Select the desired service level/amount from the list and click the Sign-Up button. You will now have a subscription added to your account.
Next, we need to tell DQS about our Azure Marketplace account. On the Microsoft Azure Marketplace website, click on ‘My Account’ and look for the Primary Account Key value.
Highlight and copy the Key value. Open the Data Quality Client tool and click on Configuration on the right-hand side. Paste your Key value into the DataMarket Account ID entry box and click on the validate button.
Once your ID has been validated you are then ready to setup a knowledge base in DQS that will use Melissa Data. On the DQS client, click on the Close button on the Configuration screen to get back to the main menu. On the top left, click on New Knowledge Base. Add 4 new domains (Address, City, State, Zip). Next, add 1 composite domain (FullAddress). Add the first 4 domains you just created (Address, City, State, Zip).
With the FullAddress composite domain selected, click on the Reference Data tab. Click on Browse and check the Melissa Data service.
On the Schema mapping table, create the following mappings between the RDS Schema and Domains:
1. AddressLine (M) = Address
2. City = City
3. State = State
4. Zip = Zip
You can adjust several of the Providers Settings such as Auto Correction Threshold, Suggested Candidates, and Min Confidence. Once finished, click the Finish button and select Publish.
Next, we will cleanse a set of data using our new knowledge base. On the main screen of the DQS client, click on New Data Quality Project. Give it a name, and be sure to select the knowledge base we just finished setting up that uses the Melissa Data reference data.
Click Next and select a data source that points to address records you would like to cleanse. Map the 4 Source Columns to the appropriate Domains. If you have mapped all 4 correctly, you should be able to click on the ‘View/Select Composite Domains’ button which tells you that the composite domain will now be used.
Click Next, and Start. DQS will analyze the data and provide a summary of results.
Click Next to see the cleansing suggestions and to ultimately work through corrections. In my example I am using 5 addresses which are Melissa Data office locations that I found on their website. I altered some aspects of the addresses to see how well it can cleanse the data. Let’s take a look at the results.
As you can see here, DQS suggested 2 corrections (Road to Rd, Drive to Dr). The interesting thing is that both of these corrections don’t match what is on their website. (They have Road and Dr.) Perhaps they need to cleanse their own addresses. 🙂 It also listed one entry as Invalid even though all I did was add a 1 to the address value. This record would need to be manually reviewed/corrected or we could adjust the threshold settings to see if we get better results. DQS successfully corrected 2 records with a good confidence score. It wasn’t fooled by me changing the City name on one and Zip on the other.
Work through the results by making adjustments, approving or rejecting and then finally exporting the records to it’s final destination (SQL Server, CSV, or Excel).
Address cleansing can be an important yet difficult task. However, with DQS we now have the ability to tap into external Knowledge Bases such as Melissa Data to help. By leveraging reference data within DQS we can quickly and effective cleanse address information.
Pingback: Cleansing Data Through SSIS with DQS « Jeff Prom's SQL Server Blog
Jeff..when you export the results from a Data Quality project that utilizes the Melissa Data Address Check service, the result set includes many extra fields to enrich the original composite domain schema, like latitude, longitude, etc.; however, the field names in the result set are “Co_1”, “Co_2”, “Co_3”, etc; therefore, it is hard to know what the result set actual contains without proper metadata, or a cross-reference map to what these generic field names mean. I have reached out to Melissa Data, they indicate this is proprietary to Microsoft’s version of the result set; my initial contact at Microsoft didn’t have the answer, leaving me with opening a ticket with Microsoft as the next step. Let me know if you have found such metadata.
Jeff..when you export the results from a Data Quality project that utilizes the Melissa Data Address Check service, the result set includes many extra fields to enrich the original composite domain schema, like latitude, longitude, etc.; however, the field names in the result set are “Co_1”, “Co_2”, “Co_3”, etc; therefore, it is hard to know what the result set actual contains without proper metadata, or a cross-reference map to what these generic field names mean. I have reached out to Melissa Data, they indicate this is proprietary to Microsoft’s version of the result set; my initial contact at Microsoft didn’t have the answer, leaving me with opening a ticket with Microsoft as the next step. Let me know if you have found such metadata.