Everybody talks about Big Data, like it’s either a monolithic thing like Ayres Rock in Australia that has been around since before there were people. You can almost imagine companies springing up around deposits of Big Data like gold miners along a stream. Could that be why so many people talk so much about data mining? I think not, because Big Data is almost an organic thing, like a little acorn that grows into a might oak, and like a seed it needs to be fed and nurtured.
While there are multiple methods of feeding a database, one of the more popular is SSIS (SQL Server Integration Services). If you’re unfamiliar with SSIS, it’s a data migration tool that is part of SQL Server Standard Edition or above, which, unfortunately, means that it isn’t part of the free Express Edition of SQL Server.
In this blog, I’ll walk you along the creation of a data project using SSIS. To play along at home, in addition to SSIS, you’ll need Microsoft’s Adventure Works Sample database and Microsoft Excel.
In the Windows start menu, SSIS can be found under Microsoft SQL Server 2012 as SQL Server Data Tools. Clicking it and selecting new project brings up the New Project dialog shown in Figure A.
Figure A
Click on figures to enlarge.
When a new project is created, you’re presented with the IDE shown in Figure B. Notice various tabs along the top, the SSIS Toolbox with its tasks on the left and the Connection Managers along the bottom.
Figure B
Looks kind of familiar, doesn’t it–like Visual Studio? That’s because it is Visual Studio. But don’t let that scare you, if you’re not really a .Net programmer. With the exception of Connection Managers, this IDE is basically drag and drop.
Now about those Connection Managers: they identify what kind of inputs and outputs are available for processing, as listed in Table A.
Table A
Type |
Description |
ADO | Connection manager for ADO connections |
ADO.NET | Connection manager for ADO.NET connections |
CACHE | Connection manager for cache |
DQS | Connection manager for DQS server |
EXCEL | Connection manager for Excel files |
FILE | Connection manager for files |
FLATFILE | Connection manager for flat files |
FTP | Connection manager for FTP connections |
HTTP | Connection manager for HTTP connections |
MSMQ | Connection manager for Message Queue task |
MSOLAP100 | Connection manager for Analysis Services connections |
MULTIFILE | Connection manager for multiple files |
MULITFLATFILE | Connection manager for multiple flat files |
ODBC | Connection manager for ODBC connections |
OLEDB | Connection manager for OLE DB connections |
SMOServer | Connection manager for SQL Server transfer tasks |
SMTP | Connection manager for SQL Server Compact connections |
WMI | Connection manager for WMI connections |
Using the list above, let’s say, rather than feed the beast, we’d like to drag some information like addresses and put them in an Excel spreadsheet. So, we’ll need a Database Connection Manager, like the OLE Connection Manager shown in Figure C. For the Excel spreadsheet, use an Excel Connection Manager, an example of which is displayed in Figure D.
Figure C
Figure D
There is one caveat: If you’re using a 64-Bit edition of Windows, you’ll need to make sure that you do not use the 64-Bit runtime. The package will execute badly. Here’s how to make sure this doesn’t happen. Open up the project’s property pages (it’s the last item under the Project tab). Next, click Configuration Properties and Debugging. Under Debugging Options, set Run64BitRuntime to False as shown in Figure E, and you’re good to go.
Figure E
There is one more item to remember. All of the record-level processing is done in Data Flow tasks, not in the Control Flow. The Control Flow is where you control the flow using the For Loop Container, the Foreach Loop Container, or the Sequence Container. But since our project is so simple, all we need to do is drag a Data Flow Task onto the Control Flow, giving us something that looks like Figure F.
Figure F
Now it’s time to click on the Data Flow Task or the Data Flow Tab, and open up the Data Flow. While it’s similar to the Control Flow, the SSIS toolbox offers some different tools. The two that we need are under Other Sources and Other Destinations, namely the OLE DB Source and the Excel Destination. Figure G shows what this looks like.
Figure G
Now drag an OLE DB Source onto the page and double-click on it. Up comes the OLE DB Source Editor. Here is where you use the OLE DB Connection Manager and select which table or stored procedure or SQL to use in order to retrieve data, like Figure H shows. This is also where you can pick the columns that you want to retrieve. Finally, here is where the error row handling can be defined.
Figure H
Once the source is defined drag an Excel Destination onto the Data Flow, then click on the OLE DB Source and drag the arrow to the Excel Destination, getting something that looks like Figure I.
Figure I
This last step is where the Excel Destination is defined using the Excel Connection Manager. It looks a lot like setting up the OLE DB Source as Figure J and Figure K show.
Figure J
Figure K
Now all that is left is to run the package by clicking the obligatory green triangle and, if you don’t see any red, looking at the result in Excel, Figure L.
Figure L
Now, this really isn’t a fair test of everything that is possible with SSIS; that would be somewhat longer than an article. It does show something of what is possible. Consider, for a moment, what if the input was Excel and the output was a database table, or a flat file and a table, or an FTP and a table, or what if both were tables?
The Data Flow doesn’t have to be a straight move; there are a bunch of tools that we never even looked at. You can sum, split, sort, fuzzy group, lookup, pivot, unpivot and even add your own code in C# or Visual Basic .Net.