微软BI 之SSIS 系列 - 如何使用 SSIS 同步指定目录下各种文档的状态(英文版)

浏览: 5423

开篇介绍

这篇文章是我在2012年5月份写在 Google Blog 上的一篇英文博客 http://simonlv.blogspot.com/2012/05/ssis-step-by-step.html ,当然这篇博客在国内其实是无法访问的。也忘记当时为什么要用英文写博客的原因了,大概齐当时可能正好在美国出差心里有那么点小九九,可能觉得把技术博客写牛一点以后到美国可能找个微软 BI 开发的工作好找一点。后来出差结束回国了,上网访问不太方便可能就没有继续再更新了。因为这些 Step by Step 博客陆陆续续有很多老外加我 Skype 或者发邮件给我,尤其三哥众多。前几天还有一个三哥问我为什么不更新博客了,我说我没有时间写,他说为什么不雇一个人继续写 ? (#¥#%¥@ 我去! 我是穷疯了吗?),我说我没有钱雇人.....三哥是不是觉得很不好笑。我看他头像年龄不小,觉得应该有将近40了吧,以为是哪家大公司的高管还窃窃自喜是不是有啥好事来了? 结果冒昧的问了一句 How old are you?  靠!22 !三哥..... 果然是少年老成,比我们要成熟一些!

2015-10-08_001541.jpg

英文水平就这么高,稍稍把以前的语句整理一下就将就看看吧。

Requirement & Preface

In this post, I'm going to introduce something about file processing. This demo could be a small project or a solution and I will show you many SSIS skills and some very tricky things.

The requirements are:

1. Here's specified folder, for example -

1.png

We need to read this directory and record the information of all files into a table. 

The information - File Name, Created Date, Modified Date, File Type, File Path, size etc.

2. If the file is updated, then sync the changes to database.

For example, if the file A has been modified, the modified date will be a new date. When we run package, the new changes will be updated in database with new file size, modified date etc. 

3. If the file is deleted, database will sync this information and won't show the file's information.

3. If a new file is added, database will sync this information and show the new file's information.

4. It also will record the historical information of files.

What's new in this post ?

1. How to get a batch of files in a script ?

2. How to get file's information - File Name, Created Date, Modified Date, File Type ?

3. How to use Look Up component to redirect Matched and NON-Matched records?

4. How to use Derived Column component to create a new column or replace some column value ?

5. How to design a simple documents-sync framework ? (Rome was not built in a day, the database design and SSIS package design will be updated until it meets all requirements, so the steps of this post will confuse you - how did you know this component should be here but not there ? This example is updated on and off for 5 times, I have to test it until it meet my requirements.)

Preparations

1. Make sure you have created these tables in demo database - SSISDemoDB.   

USE SSISDemoDB
GO

IF OBJECT_ID('DocumentTypeMapping') IS NOT NULL
DROP TABLE DocumentTypeMapping

CREATE TABLE DocumentTypeMapping
(
      ID INT IDENTITY(1,1) PRIMARY KEY,
      FileExtension NVARCHAR(50) NOT NULL,
      FileTypeDescription NVARCHAR(100) NOT NULL
)

INSERT INTO DocumentTypeMapping VALUES
('dtsx','Integration Services Package'),
('database','Analysis Services Database'),
('dtproj','Integration Services project file'),
('user','Visual Studio Project User Options file'),
('dtsx','Integration Services Package'),
('txt','Text Document'),
('xlsx','Microsoft Excel Worksheet'),
('xlsx','Microsoft Excel Worksheet') ,
('vsd','Microsoft Visio Drawing'),
('doc','Microsoft Word Document 2003-2007  ')

IF OBJECT_ID('DocumentsStaging') IS NOT NULL
DROP TABLE DocumentsStaging

CREATE TABLE DocumentsStaging
(
      ID INT IDENTITY(1,1) PRIMARY KEY,
      NAME NVARCHAR(500) NOT NULL,
      FullPath NVARCHAR(2000) NOT NULL,
      CreatedDate DATETIME NOT NULL,
      FirstModifiedDate DATETIME NOT NULL,
      NewModifiedDate     DATETIME NOT NULL,
      FileType NVARCHAR(200) NOT NULL,
      Size BIGINT NOT NULL,
      BatchInsertTime DATETIME DEFAULT GETDATE() NOT NULL
)

IF OBJECT_ID('Documents') IS NOT NULL
DROP TABLE Documents

CREATE TABLE Documents
(
      ID INT IDENTITY(1,1) PRIMARY KEY,
      NAME NVARCHAR(500) NOT NULL,
      FullPath NVARCHAR(2000) NOT NULL,
      CreatedDate DATETIME NOT NULL,
      ModifiedDate DATETIME NOT NULL,
      FileType NVARCHAR(200) NOT NULL,
      Size INT NOT NULL,
      Operation NVARCHAR(50) NOT NULL
)

IF OBJECT_ID('DocumentsErrorExtension') IS NOT NULL
DROP TABLE DocumentsErrorExtension

CREATE TABLE DocumentsErrorExtension
(
      ID INT IDENTITY(1,1) PRIMARY KEY,
      NAME NVARCHAR(500) NOT NULL,
      FullPath NVARCHAR(2000) NOT NULL,
      FileType NVARCHAR(200) NOT NULL, 
)

Description 

Table DocumentTypeMapping shows the description of specified file, usually we get a file with its extension like - .doc , .txt, .ppt. We won't show user such information because the user may not understand what does '.ppt' or 'prot' mean, so please configure this mapping before you composed your file processing package.

Table DocumentsStaging - It mostly like a historical table which records all documents information in historically loading files from specified directory. 

For example, the first time we load folder 'Test', there're A,B,C file in this folder, we record these information when we run the package at the first time. But at the second time when we run this package, we only find file B inside and a new file D.

Then we will record like this -

ID, File Name, BatchInsertTime

1   A          2011

2   B          2011

3   C          2011

4   B          2012

5   D          2012

Table Documents will only show the current existing files in specified folder.

So the first time when we load files, it will show - 

ID, File Name, BatchInsertTime

1   A          2011

2   B          2011

3   C          2011

The second time it will show - 

ID, File Name, BatchInsertTime

1   B          2012

2   D          2012

Because system will think the file A and C have been deleted by user or their name has been changed. 

So if user want a report to show files information, table DocumentsStaging can show the history and table Documents can show current information of files.

Table DocumentsErrorExtension will record all mismatched file type. For example, there's no description for file extension - '.mdb'. If package process a new file with file extension .mdb, the information the this file won't be recorded in table Documents, it will be recored in DocumentsErrorExtension to let system admin know this file doesn't has a file type description, we need to insert a description for this file first.

SSIS Framework

Control Flow -

2.png

Data Flow - 

3.png

Package Variables -

2015-10-08_004407.jpg

Steps

1. Create a package 'CH03_SyncDocuments' and defined package variables as above figure.

Package variable - Files. It will be used to store all file names from specified package variable - Directory.

Please assign your custom file folder path to 'Directory'

2. Drag a script component to Control Flow Pane, name it as 'SRC_GetFileNames', click to open it.

    ReadOnlyVariables - Directory (We have assigned a value to this variable in step 1)

    ReadWriteVariables - Files (All file names will be saved in this variable)

4.png

Click Edit Script and import System.IO in default class, add such codes.

public void Main()
        {
            //Get the directory from user defined variable
            string directory = Dts.Variables["User::Directory"].Value.ToString();

            //Get all file names in specified directory
            string[] fileNames = Directory.GetFiles(directory);

            //Save to user defined variable
            Dts.Variables["User::Files"].Value = fileNames;

            Dts.TaskResult = (int)ScriptResults.Success;
        }

Save these changes and run package to make sure if can you see a green status of component. If script component becomes green, that means we can load all file names correctly from specified folder.

We often do this test when we finished a configuration to each component. 

3. Drag a loop foreach container as figure shows (Files Loop):

5.png

6.png7.png

User::FilePath's value is from User::Files in a loop. 

4. Drag a script component with name 'SCR_LoadFileInformation' into loop container, edit this component 

    We use User::FilePath to load a file and read all properties of this file.    

    ReadWriteVariables are variables which will save all loaded properties from a file.

8.png

Edit script, import System.IO and try to understand such code inside -

public void Main()
        {
            string fileName = Dts.Variables["User::FilePath"].Value.ToString();

            FileInfo file = new FileInfo(fileName);

            Dts.Variables["User::Name"].Value = file.Name;
            Dts.Variables["User::CreatedDate"].Value = file.CreationTime;
            Dts.Variables["User::ModifiedDate"].Value = file.LastWriteTime;
            Dts.Variables["User::FileType"].Value = file.Extension.Replace('.',' ').Trim();
            Dts.Variables["User::Size"].Value = (int)(file.Length/1024);
            Dts.Variables["User::FilePath"].Value = file.FullName;

            Dts.TaskResult = (int)ScriptResults.Success;
        }

How to pass a package variables and how to assign value to package variables ? You should know the answers.

Run this package, you should make sure all components are green finally.

5. In loop container, add Execute SQL Task with name EST_InsertFileInLoop, edit it.

You should know how to build a OLE DB connection, I have mentioned it in previous posts.

For SQL Statement, we need to build a SQL to insert all document information to a staging table, the parameter will use ? to instead of the real value, we will configure parameters in 'Parameter Mapping'.

9.png

We need to know which values we should to insert, then map them like below figure shows:

CREATE TABLE DocumentsStaging
(
ID INT IDENTITY(1,1) PRIMARY KEY,
NAME NVARCHAR(500) NOT NULL,                           -- User::Name
FullPath NVARCHAR(2000) NOT NULL,                        -- User::FilePath
CreatedDate DATETIME NOT NULL,                              -- User::CreatedDate
FirstModifiedDate DATETIME NOT NULL,                    -- User::ModifiedDate
NewModifiedDate  DATETIME NOT NULL,                 -- User::ModifiedDate
FileType NVARCHAR(200) NOT NULL,                         -- User::FileType
Size BIGINT NOT NULL,                                                  -- User::Size  (KB)
BatchInsertTime DATETIME DEFAULT GETDATE() NOT NULL  -- System Variable: System::StartTime
)

10.png

FirstModifiedDate DATETIME NOT NULL,                    -- User::ModifiedDate
NewModifiedDate   DATETIME NOT NULL,                 -- User::ModifiedDate

  • In Windows OS, we use C# can easily get a file's properties like created date and modified date.   
  • For example, I created a file on 05/01/2011.  Then I updated this file with new contents on 06/01/2011.
  • So the created date property of this file is  - 05/01/2011, the modified date property is 06/01/2011.
  • But for this demo, we need to know which files are new added, which files are new updated, the basis is we judge if the created date is same as modified date. If created date = modified date, this file is new added, if created date <> modified date, we think this file has been updated.
  • The problem for this demo is, if we load this file on 07/01/2011 at the first time, what conclusion we should make ? Is is a updated file or a new file. Actually it should be a new file to us, it may be edited several times before our package load it.
  • So I design two columns in tables, both columns store the value of Last Access Time of the file. The FirstModifiedDate doesn't mean the real first update time (06/01/2011) but means the last modified time when loading this file to staging table.

BatchInsertTime DATETIME DEFAULT GETDATE() NOT NULL  -- System Variable: System::StartTime

  •     We need to record which files are loaded per every time we run this package. 

11.png

You should understand the above steps doing:

  • Get all file names and save to an object which is used in loop container.
  • Each loop, we can use a file name to load a file and get properties of this file. All properties of a file are save in different package variables.
  • Map these package variables and compose SQL with these variable to insert a file into staging table.
  • When the loop ends, the information of all files are recorded. 

6. Build Execute SQL Task to delete error extension information and add a data flow task to process the main sync logic.

Execute SQL Task - EST_DeleteErrorType

Data Flow Task - UpdateDocuments

12.png

Edit EST_DeleteErrorTypes with existing OLE DB connection.SQL Statement - 

DELETE FROM dbo.DocumentsErrorExtension

WHERE FileType IN(

      SELECT DISTINCT FileType

      FROM dbo.DocumentsStaging

)

13.png

The component is used to delete all error information from DocumentErrorExtension.  If the file type of a file doesn't map to mapping table, the information will be recorded in DocumentErrorExtension.

Go into data flow task - click 'UpdateDocuments', we 're going to design data flow task to sync documents.

7. Please see the data flow framework again.

14.png

The main logic is -

  • Get last batch documents from staging table to look up documents table.
  • NO Match - The current file doesn't exist in documents table, then insert new file information to documents (Also need to compare the file type, if the file type doesn't exist in file type mapping table, it will redirect to error extension table).
  • Match - The current file exists in documents table, need to update document information in document table.

8. Add OLE DB Source in data flow, new name is 'OLE_SRC_GetStaging', edit it.

    Getting the last batch of files from staging table as a source.

  SELECT *

  FROM dbo.DocumentsStaging

  WHERE BatchInsertTime = (SELECT MAX(BatchInsertTime) FROM dbo.DocumentsStaging)

15.png

 Pass all columns to next component.

16.png

9. Add a look up component - 'LKP_LookUpDocuments', edit this component.

    Choose 'Redirect rows to no match output'

17.png

Use 'Document' table as a look up table.

18.png

Available Input columns are from 'OLE_SRC_GetStaging', please see Step 8.

Available Lookup Columns are from table Documents.

Simply to say, there're two tables - DocumentsStaging and Documents. Just like the join in SQL, we use NAME and CreatedDate to see if files in staging exist in Documents.

There're two results here, Matched and NON-Matched, so two outputs are also generated in this component - Matched Output and NON-Matched Output, these outputs can be a source for next data flow component.

Drag 'Name' in input column to the 'Name' in lookup columns, same operation to CreatedDate.

19.png

10. For No Match Output, we're going to insert a new record. But we need to check if the file type exist in mapping table first.

20.png

Drag a lookup component 'LKP_Look Up File Extension' and tie to 'LKP_LookUpDocuments', choose 'Lookup No Match Output' when message box shows up.

Drag a OLE DB Components  CMD_UpdateDocuments' and tie to 'LKP_LookUpDocuments', choose 'Lookup Match Output' when message box shows up.

Need to understand the source of Matched and Non-Matched components.

For Non Matched component 'LKP_Look Up File Extension', its source columns are from the original source 'OLE_SRC_GetStaging'.

For Matched component 'CMD_UpdateDocuments', its source columns are from the join columns of 'OLE_SRC_GetStaging' and       'LKP_LookUpDocuments' base on 'Name' and 'CreatedDate'.

11. Edit LKP_Look Up File Extension

21.png

Choose look up table - DocumentTypeMapping

22.png

File Type in input columns maps to File Extension in look up columns.

23.png

12. If mapped, that means the file type from staging table is correct type, otherwise, it should be incorrect file type which haven't been recorded in file mapping table.

24.png

 13. If mapped, we get those records from staging table with the file type description and insert into table Documents.

      Drag a Derived Column component ''DC_InsertOperation' and tie to 'LKP_Look Up File Extension' by using 'Lookup Match Output', edit it.25.png

A new column 'Operation' will be built with a value 'NEW', this column will be added into up stream input columns and will be passed together to next component as a source.

14. Add an OLE DB Destination component 'DEST_InsertNewDocuments' to connect to ''DC_InsertOperation', edit it.

26.png

Do mapping.

27.png

15. Drag another component 'DEST_NoMatchedFileExtension', edit it.

28.png

29.png

16. CMD_UpdateDocuments

30.png 

OLE DB Connection

31.png

Edit SQL Command

UPDATE dbo.Documents 
SET FullPath = ?,
        ModifiedDate = ?,
        Size = ?,
        Operation = 'UPDATE'
WHERE NAME = ?
          AND     FileType = ?
          AND     CreatedDate = ?
          AND     UPPER(Operation) = 'NEW'
          AND     ModifiedDate <>  ?

32.png

33.png

 17. CMD_DeleteNONExistDoc

 DELETE FROM dbo.Documents
  WHERE Name NOT IN(
      SELECT NAME FROM
      dbo.DocumentsStaging
      WHERE BatchInsertTime =
      (
            SELECT MAX(BatchInsertTime)
            FROM dbo.DocumentsStaging 
      )
  )

34.png

18. Test

35.png

36.png

37.png

19. Add a new file 

38.png

39.png

20. Update contents.

40.png


41.png

FINISH!!!

推荐 2
本文由 BIWORK 创作,采用 知识共享署名-相同方式共享 3.0 中国大陆许可协议 进行许可。
转载、引用前需联系作者,并署名作者且注明文章出处。
本站文章版权归原作者及原出处所有 。内容为作者个人观点, 并不代表本站赞同其观点和对其真实性负责。本站是一个个人学习交流的平台,并不用于任何商业目的,如果有任何问题,请及时联系我们,我们将根据著作权人的要求,立即更正或者删除有关内容。本站拥有对此声明的最终解释权。

0 个评论

要回复文章请先登录注册