Cleansing Data Through SSIS with DQS

While Data Quality Services (DQS) takes some time to build up an effective knowledge base through manual data review, there are times when it sure would be handy to use SSIS. This is only recommended when you need quick results and are feeling pretty good about your knowledge base or are perhaps using an external knowledge base through reference data. Should you decide to cleanse data through SSIS using DQS, it’s important to note that it can export some quality control metrics such as Status and Confidence score with the final data which can later be used to review the results and make final decisions. So let’s take a look at using the DQS Cleansing transform in SSIS.

In this example, I created a basic SSIS package with one Data Flow. I selected an Excel Source and pointed it to a file that has some data to be cleansed. In my example I am going to send in some bad addresses and attempt to cleanse them using Melissa Data. To learn more about the Melissa Data Address Check add-in and reference data in general, view my other blog post here.

Create a new package, add a Data Flow Task, and then add a DQS Cleansing transform object.

dqs cleansing transform setup

Click on the New button next to the Data quality connection manager to setup a new connection to an instance of DQS and then select a knowledge base.

dqs cleansing transform setup with connection

Click on the mappings tab. Here you will map input columns from your data source to the domains on the knowledge base.

dqs cleansing transform with mappings

If you would like to see additional columns on the final results such as the Confidence score and the Reason then click on the Advanced tab.

dql cleansing transform advanced tab

When finished, click OK and test it out. In my example I am just going to a Union All object as my destination for demo purposes but I can still see the resulting cleansed data through the Data Viewer.

dqs cleansing results from transform

As you can see from the results, two addresses were corrected, two had suggestions, and one failed to be cleansed. We could get fancy and split up the results based on the Status values. Corrected ones may be considered ready to ship. Auto Suggest records may want to be reviewed, and the New ones would need to be reviewed since it was not found in the knowledge base.

Using the DQS Cleansing transform in SSIS is useful but doesn’t eliminate the need to review the final cleansed results. Cleansing data through SSIS is dependent on the quality of the underlying knowledge base but does provide a quick and efficient way to do an initial data cleanse. The resulting cleansed data can then be evaluated using the Status, Reason and Confidence score values. Once the data has been processed and reviewed, only then should it be considered cleansed and ‘ready for use’.

Advertisement

Cleansing Address Data With Data Quality Services (DQS) and Melissa Data

In this post I will show how to cleanse address data using Microsoft Data Quality Services (DQS) and Melissa Data as a reference data service. First, you need to make sure you have a working installation of DQS. If you don’t, you can view my other post here that walks you through installing DQS. Once DQS has been installed, you will need to visit the Microsoft Azure Marketplace website and subscribe to the Melissa Data Address Check service. When I wrote this post, you could get 1,000 addresses cleansed per month for free. This is useful for testing or if you have a small set of data to be cleansed. Select the desired service level/amount from the list and click the Sign-Up button. You will now have a subscription added to your account.

melissa data service on the azure marketplace

Next, we need to tell DQS about our Azure Marketplace account. On the Microsoft Azure Marketplace website, click on ‘My Account’ and look for the Primary Account Key value.

azure marketplace ID

Highlight and copy the Key value. Open the Data Quality Client tool and click on Configuration on the right-hand side. Paste your Key value into the DataMarket Account ID entry box and click on the validate button.

dqs validate account ID

Once your ID has been validated you are then ready to setup a knowledge base in DQS that will use Melissa Data. On the DQS client, click on the Close button on the Configuration screen to get back to the main menu. On the top left, click on New Knowledge Base. Add 4 new domains (Address, City, State, Zip). Next, add 1 composite domain (FullAddress). Add the first 4 domains you just created (Address, City, State, Zip).

setting up the knowledge base domains

With the FullAddress composite domain selected, click on the Reference Data tab. Click on Browse and check the Melissa Data service.

On the Schema mapping table, create the following mappings between the RDS Schema and Domains:
1. AddressLine (M) = Address
2. City = City
3. State = State
4. Zip = Zip

adding melissa data as reference data

You can adjust several of the Providers Settings such as Auto Correction Threshold, Suggested Candidates, and Min Confidence. Once finished, click the Finish button and select Publish.

save knowledge base with reference data

Next, we will cleanse a set of data using our new knowledge base. On the main screen of the DQS client, click on New Data Quality Project. Give it a name, and be sure to select the knowledge base we just finished setting up that uses the Melissa Data reference data.

new data quality project

Click Next and select a data source that points to address records you would like to cleanse. Map the 4 Source Columns to the appropriate Domains. If you have mapped all 4 correctly, you should be able to click on the ‘View/Select Composite Domains’ button which tells you that the composite domain will now be used.

data quality project - map domains

Click Next, and Start. DQS will analyze the data and provide a summary of results.

cleansing profile results

Click Next to see the cleansing suggestions and to ultimately work through corrections. In my example I am using 5 addresses which are Melissa Data office locations that I found on their website. I altered some aspects of the addresses to see how well it can cleanse the data. Let’s take a look at the results.

record results before adjustments

As you can see here, DQS suggested 2 corrections (Road to Rd, Drive to Dr). The interesting thing is that both of these corrections don’t match what is on their website. (They have Road and Dr.) Perhaps they need to cleanse their own addresses. 🙂 It also listed one entry as Invalid even though all I did was add a 1 to the address value. This record would need to be manually reviewed/corrected or we could adjust the threshold settings to see if we get better results. DQS successfully corrected 2 records with a good confidence score. It wasn’t fooled by me changing the City name on one and Zip on the other.

approved results

Work through the results by making adjustments, approving or rejecting and then finally exporting the records to it’s final destination (SQL Server, CSV, or Excel).

Address cleansing can be an important yet difficult task. However, with DQS we now have the ability to tap into external Knowledge Bases such as Melissa Data to help. By leveraging reference data within DQS we can quickly and effective cleanse address information.

Installing Data Quality Services on SQL Server 2012

Data Quality Services is the new data cleansing tool provided in SQL Server 2012 Enterprise and BI Editions. DQS is a knowledge-driven solution that enables you to do data cleansing and matching through the DQS client, or automated cleansing through the DQS Cleansing transform in SSIS.

In this blog post I will show how to install DQS. When you first install SQL Server, there are two DQS options to select if you would like to use DQS. As shown here, the top option (Instance Features\Data Quality Services) will install the DQS Server files. The second option (Shared Features\Data Quality Client) will install the DQS client.

DQS Install File Options

Once you have the DQS files available from the SQL Server install, the next step is to run the DQS Installer. This can be found under the Microsoft SQL Server 2012 menu under Data Quality Services. Click on the Data Quality Server Installer.

DQS Menu Installer

This will launch a command prompt window and start the install. As seen here, you will need to enter in a Database Master Key password.

DQS Install command prompt

The install will continue and take on average about 7 minutes to finish. Once it finishes, you will be presented with a final message stating that DQS Installer finished sucessfully, press any key to continue. Simply press any key and the window will close.

DQS Installer command prompt finished

Congratulations, you now have the DQS server installed!

After doing a refresh on the database list in management studio, we can see there are now three new DQS databases that have been installed.

DQS_MAIN – Includes DQS stored procedures, the DQS engine, and published knowledge bases.
DQS_PROJECTS – Includes data that is required for knowledge base management and DQS project activities.
DQS_STAGING_DATA – Similar to tempdb.

DQS Installed Databases

While knowing about these databases is helpful, all of the DQS work will primarlily be done using the DQS client tool. This tool can be found in the same menu location where the DQS server install was located. When you launch the Data Quality Client tool you will be presented with a prompt to connect to the DQS server.

DQS Connect to server

Enter the instance location where DQS was installed and click Connect. You are now logged into the DQS client and will see the main screen.

DQS Client

On the left side is where you work with Knowledge Bases. The middle section is for creating projects to do data cleansing and data matching. The right side is for Administration tasks. These will each be discussed further in future posts, but you should now at least have DQS installed and ready to go.

Importing Cleansed Values into a Data Quality Services Knowledge Base

Data Quality Services is a new tool in SQL Server 2012 that helps cleanse data. At the heart of this product are knowledge bases. Knowledge bases are a repository of managed information with correct data values. Once you have a good working knowledge base in place, you can then run ‘dirty’ data against the knowledge base. Fuzzy logic is applied to try and correct, or cleanse, the unclean data and you are presented with suggested clean data which you can approve or reject.

After cleansing data against a knowledge base, the question usually arises. If there are new values in the ‘dirty’ dataset which I manually corrected/approved and are not in the knowledge base, does it automatically get added to the knowledge base? The short answer is no, it doesn’t. While it sounds like a great idea to just keep automatically adding these values to the knowledge base, in reality it’s supposed to be more of a managed process.

Even though new values aren’t automatically added to the knowledge base, DQS does provide the ability to import newly cleansed values into the knowledge base afterwards. Let’s take a look at how this is done. First, select Domain Management on the Knowledge Base.

DQS Domain Management

Select the desired Domain and on the left side. Now, near the top right grab the drop down on the ‘Import Values’ button. You are presented with two choices; Import project values, and Import valid values from Excel. The first option allows you to import values from a previous cleansing project. As stated, the second option allows data to be imported from Excel.

DQS Import Project Values

In our example, we will import values from a previously ran cleansing project. Simply select the desired project from the list.

DQS Import Values From Project

The next screen will present new values which may include Correct values, Errors, and Invalid values. All of which will continue to add value to the knowledge base. In this screenshot, we see that there will be two corrected values (Errors) and one new value (Correct) added.

DQS Imported Values

Click finish to close the import box and let’s take a look at what happened. After looking at the Domain values, we can see that our values have indeed been added and therefore helped improve the knowledge base.

DQS Imported Values Complete

Knowledge bases are very important when cleansing data in DQS. In this example we saw how to continually improve knowledge bases by importing cleansed data. As your knowledge bases continue to evolve, you should also see the quality of cleansed data improve over time.

SQL Saturday #175 – Fargo, North Dakota

SQL Saturday #175

Saturday, April 27th I will be presenting at SQL Saturday #175 in Fargo, North Dakota. This event will be held at the Microsoft Executive Briefing Center at 4550 42nd Street South, Fargo ND, 58104.
I will be speaking on the following topic:

Data Quality Services
As the old adage goes, ‘Garbage in, garbage out.’ Stop spending hours manually cleansing your data, and start using DQS! DQS is the new data quality tool provided with SQL Server 2012. By leveraging knowledge bases, you can correct invalid data and improve the quality of your data.

In this session we will cover creating a knowledge base, data cleansing, data matching, and using 3rd party knowledge bases (Reference Data). We will talk about installation, show how to use the DQS tool, and how to integrate DQS into your SSIS packages. After this session you should have a good understanding of what DQS is, how to use it, and how to get started with your own Data Quality initiative.

Check out the SQL Saturday #175 page for more information.

See you at the event!

MDS & DQS Posts Coming Soon

I’ve now given a couple of presentations on both Master Data Services (MDS), and Data Quality Services (DQS). However, I now realized I haven’t really blogged much about either. Well this is about to change! I have a whole series of posts planned which I will be posting in the near future. Here is a list of sub-topics which I will be posting on:

Data Quality Services

-What is it and what does it do?

-Installation

-Knowledge Bases

-Data Matching

-Data Cleansing

-Using DQS with SSIS

-Composite Domains

-Business Rules

-Reference Data Services

-Security

Master Data Services

-What is it and what does it do?

-Installation

-Creating & Managing Domains

-MDS Modeling

-Collections

-Excel Add-In

-Business Rules

-Using Versions

-Staging Data

-Hierarchies

-Importing & Export Models

-Views

-Security

More to come soon so stay tuned as I start to roll out some posts on both of these tools.

Integrating MDS with DQS

I’ve installed both DQS and MDS several times, but often had trouble trying to integrate Master Data Services (MDS) with Data Quality Services (DQS). By integrating MDS and DQS, you gain the ability to do data matching on your MDS data. Sounds cool enough, right? In this post I will show how to integrate these two tools and enable a couple of DQS options when using MDS.

First, let’s run through the typical worst case scenario. After installing MDS, the very last step is that long button which enables DQS Integration.

Clicking this button, the install program looks in the same SQL Server Instance which you just installed MDS into and searches for a database called DQS_MAIN. If you haven’t yet installed DQS, it will fail and you will get the following error message.

If you installed MDS before installing DQS, all hope is not lost. After installing MDS, simply install DQS and then come back into the MDS Configuration screen and enable DQS integration.
Now when you can click the ‘Enable DQS Integration’ button, you should get a success message.

Ok, so we have both installed and they have been integrated. Now let’s take a look at our new found powers. If we fire up Excel and click on the ‘Master Data’ tab, you can see that we now have access to the Data Quality tools as well!

With these tools, we can now utilize the matching rules which we can setup in DQS.

Without integrating DQS with MDS, you can see here in this ‘before’ screenshot that the DQS tools are not available.

(If you don’t have the MDS Add-in for excel, you can download it here: http://go.microsoft.com/fwlink/?LinkId=219530)

Here is a video tutorial on how to use the DQS tools with MDS:
http://msdn.microsoft.com/en-us/sqlserver/hh828790.aspx

Here is a link which describes using Data Quality Matching in the MDS Add-in for Excel:
http://msdn.microsoft.com/en-us/library/hh548681.aspx