Pentaho Kettle Solutions: Building Open Source ETL Solutions with Pentaho Data IntegrationISBN: 978-0-470-63517-9
Paperback
720 pages
September 2010
This is a Print-on-Demand title. It will be printed specifically to fill your order. Please allow an additional 10-15 days delivery time. The book is not returnable.
|
Introduction xxxi
Part I Getting Started 1
Chapter 1 ETL Primer 3
OLTP versus Data Warehousing 3
What Is ETL? 5
The Evolution of ETL Solutions 5
ETL Building Blocks 7
ETL, ELT, and EII 8
ELT 9
EII: Virtual Data Integration 10
Data Integration Challenges 11
Methodology: Agile BI 12
ETL Design 14
Data Acquisition 14
Beware of Spreadsheets 15
Design for Failure 15
Change Data Capture 16
Data Quality 16
Data Profiling 16
Data Validation 17
ETL Tool Requirements 17
Connectivity 17
Platform Independence 18
Scalability 18
Design Flexibility 19
Reuse 19
Extensibility 19
Data Transformations 20
Testing and Debugging 21
Lineage and Impact Analysis 21
Logging and Auditing 22
Summary 22
Chapter 2 Kettle Concepts 23
Design Principles 23
The Building Blocks of Kettle Design 25
Transformations 25
Steps 26
Transformation Hops 26
Parallelism 27
Rows of Data 27
Data Conversion 29
Jobs 30
Job Entries 31
Job Hops 31
Multiple Paths and Backtracking 32
Parallel Execution 33
Job Entry Results 34
Transformation or Job Metadata 36
Database Connections 37
Special Options 38
The Power of the Relational Database 39
Connections and Transactions 39
Database Clustering 40
Tools and Utilities 41
Repositories 41
Virtual File Systems 42
Parameters and Variables 43
Defining Variables 43
Named Parameters 44
Using Variables 44
Visual Programming 45
Getting Started 46
Creating New Steps 47
Putting It All Together 49
Summary 51
Chapter 3 Installation and Configuration 53
Kettle Software Overview 53
Integrated Development Environment: Spoon 55
Command-Line Launchers: Kitchen and Pan 57
Job Server: Carte 57
Encr.bat and encr.sh 58
Installation 58
Java Environment 58
Installing Java Manually 58
Using Your Linux Package Management System 59
Installing Kettle 59
Versions and Releases 59
Archive Names and Formats 60
Downloading and Uncompressing 60
Running Kettle Programs 61
Creating a Shortcut Icon or Launcher for Spoon 62
Configuration 63
Configuration Files and the .kettle Directory 63
The Kettle Shell Scripts 69
General Structure of the Startup Scripts 70
Adding an Entry to the Classpath 70
Changing the Maximum Heap Size 71
Managing JDBC Drivers 72
Summary 72
Chapter 4 An Example ETL Solution—Sakila 73
Sakila 73
The Sakila Sample Database 74
DVD Rental Business Process 74
Sakila Database Schema Diagram 75
Sakila Database Subject Areas 75
General Design Considerations 77
Installing the Sakila Sample Database 77
The Rental Star Schema 78
Rental Star Schema Diagram 78
Rental Fact Table 79
Dimension Tables 79
Keys and Change Data Capture 80
Installing the Rental Star Schema 81
Prerequisites and Some Basic Spoon Skills 81
Setting Up the ETL Solution 82
Creating Database Accounts 82
Working with Spoon 82
Opening Transformation and Job Files 82
Opening the Step’s Configuration Dialog 83
Examining Streams 83
Running Jobs and Transformations 83
The Sample ETL Solution 84
Static, Generated Dimensions 84
Loading the dim_date Dimension Table 84
Loading the dim_time Dimension Table 86
Recurring Load 87
The load_rentals Job 88
The load_dim_staff Transformation 91
Database Connections 91
The load_dim_customer Transformation 95
The load_dim_store Transformation 98
The fetch_address Subtransformation 99
The load_dim_actor Transformation 101
The load_dim_film Transformation 102
The load_fact_rental Transformation 107
Summary 109
Part II ETL 111
Chapter 5 ETL Subsystems 113
Introduction to the 34 Subsystems 114
Extraction 114
Subsystems 1–3: Data Profiling, Change Data Capture, and
Extraction 115
Cleaning and Conforming Data 116
Subsystem 4: Data Cleaning and Quality Screen
Handler System 116
Subsystem 5: Error Event Handler 117
Subsystem 6: Audit Dimension Assembler 117
Subsystem 7: Deduplication System 117
Subsystem 8: Data Conformer 118
Data Delivery 118
Subsystem 9: Slowly Changing Dimension Processor 118
Subsystem 10: Surrogate Key Creation System 119
Subsystem 11: Hierarchy Dimension Builder 119
Subsystem 12: Special Dimension Builder 120
Subsystem 13: Fact Table Loader 121
Subsystem 14: Surrogate Key Pipeline 121
Subsystem 15: Multi-Valued Dimension Bridge Table Builder 121
Subsystem 16: Late-Arriving Data Handler 122
Subsystem 17: Dimension Manager System 122
Subsystem 18: Fact Table Provider System 122
Subsystem 19: Aggregate Builder 123
Subsystem 20: Multidimensional (OLAP) Cube Builder 123
Subsystem 21: Data Integration Manager 123
Managing the ETL Environment 123
Summary 126
Chapter 6 Data Extraction 127
Kettle Data Extraction Overview 128
File-Based Extraction 128
Working with Text Files 128
Working with XML files 133
Special File Types 134
Database-Based Extraction 134
Web-Based Extraction 137
Text-Based Web Extraction 137
HTTP Client 137
Using SOAP 138
Stream-Based and Real-Time Extraction 138
Working with ERP and CRM Systems 138
ERP Challenges 139
Kettle ERP Plugins 140
Working with SAP Data 140
ERP and CDC Issues 146
Data Profiling 146
Using eobjects.org DataCleaner 147
Adding Profile Tasks 149
Adding Database Connections 149
Doing an Initial Profile 151
Working with Regular Expressions 151
Profiling and Exploring Results 152
Validating and Comparing Data 153
Using a Dictionary for Column Dependency Checks 153
Alternative Solutions 154
Text Profiling with Kettle 154
CDC: Change Data Capture 154
Source Data–Based CDC 155
Trigger-Based CDC 157
Snapshot-Based CDC 158
Log-Based CDC 162
Which CDC Alternative Should You Choose? 163
Delivering Data 164
Summary 164
Chapter 7 Cleansing and Conforming 167
Data Cleansing 168
Data-Cleansing Steps 169
Using Reference Tables 172
Conforming Data Using Lookup Tables 172
Conforming Data Using Reference Tables 175
Data Validation 179
Applying Validation Rules 180
Validating Dependency Constraints 183
Error Handling 183
Handling Process Errors 184
Transformation Errors 186
Handling Data (Validation) Errors 187
Auditing Data and Process Quality 191
Deduplicating Data 192
Handling Exact Duplicates 193
The Problem of Non-Exact Duplicates 194
Building Deduplication Transforms 195
Step 1: Fuzzy Match 197
Step 2: Select Suspects 198
Step 3: Lookup Validation Value 198
Step 4: Filter Duplicates 199
Scripting 200
Formula 201
JavaScript 202
User-Defined Java Expressions 202
Regular Expressions 203
Summary 205
Chapter 8 Handling Dimension Tables 207
Managing Keys 208
Managing Business Keys 209
Keys in the Source System 209
Keys in the Data Warehouse 209
Business Keys 209
Storing Business Keys 210
Looking Up Keys with Kettle 210
Generating Surrogate Keys 210
The “Add sequence” Step 211
Working with auto_increment or IDENTITY Columns 217
Keys for Slowly Changing Dimensions 217
Loading Dimension Tables 218
Snowflaked Dimension Tables 218
Top-Down Level-Wise Loading 219
Sakila Snowflake Example 219
Sample Transformation 221
Database Lookup Configuration 222
Sample Job 225
Star Schema Dimension Tables 226
Denormalization 226
Denormalizing to 1NF with the “Database lookup” Step 226
Change Data Capture 227
Slowly Changing Dimensions 228
Types of Slowly Changing Dimensions 228
Type 1 Slowly Changing Dimensions 229
The Insert / Update Step 229
Type 2 Slowly Changing Dimensions 232
The “Dimension lookup / update” Step 232
Other Types of Slowly Changing Dimensions 237
Type 3 Slowly Changing Dimensions 237
Hybrid Slowly Changing Dimensions 238
More Dimensions 239
Generated Dimensions 239
Date and Time Dimensions 239
Generated Mini-Dimensions 239
Junk Dimensions 241
Recursive Hierarchies 242
Summary 243
Chapter 9 Loading Fact Tables 245
Loading in Bulk 246
STDIN and FIFO 247
Kettle Bulk Loaders 248
MySQL Bulk Loading 249
LucidDB Bulk Loader 249
Oracle Bulk Loader 249
PostgreSQL Bulk Loader 250
Table Output Step 250
General Bulk Load Considerations 250
Dimension Lookups 251
Maintaining Referential Integrity 251
The Surrogate Key Pipeline 252
Using In-Memory Lookups 253
Stream Lookups 253
Late-Arriving Data 255
Late-Arriving Facts 256
Late-Arriving Dimensions 256
Fact Table Handling 260
Periodic and Accumulating Snapshots 260
Introducing State-Oriented Fact Tables 261
Loading Periodic Snapshots 263
Loading Accumulating Snapshots 264
Loading State-Oriented Fact Tables 265
Loading Aggregate Tables 266
Summary 267
Chapter 10 Working with OLAP Data 269
OLAP Benefits and Challenges 270
OLAP Storage Types 272
Positioning OLAP 272
Kettle OLAP Options 273
Working with Mondrian 274
Working with XML/A Servers 277
Working with Palo 282
Setting Up the Palo Connection 283
Palo Architecture 284
Reading Palo Data 285
Writing Palo Data 289
Summary 291
Part III Management and Deployment 293
Chapter 11 ETL Development Lifecycle 295
Solution Design 295
Best and Bad Practices 296
Data Mapping 297
Naming and Commentary Conventions 298
Common Pitfalls 299
ETL Flow Design 300
Reusability and Maintainability 300
Agile Development 301
Testing and Debugging 306
Test Activities 307
ETL Testing 308
Test Data Requirements 308
Testing for Completeness 309
Testing Data Transformations 311
Test Automation and Continuous Integration 311
Upgrade Tests 312
Debugging 312
Documenting the Solution 315
Why Isn’t There Any Documentation? 316
Myth 1: My Software Is Self-Explanatory 316
Myth 2: Documentation Is Always Outdated 316
Myth 3: Who Reads Documentation Anyway? 317
Kettle Documentation Features 317
Generating Documentation 319
Summary 320
Chapter 12 Scheduling and Monitoring 321
Scheduling 321
Operating System–Level Scheduling 322
Executing Kettle Jobs and Transformations from
the Command Line 322
UNIX-Based Systems: cron 326
Windows: The at utility and the Task Scheduler 327
Using Pentaho’s Built-in Scheduler 327
Creating an Action Sequence to Run Kettle Jobs and
Transformations 328
Kettle Transformations in Action Sequences 329
Creating and Maintaining Schedules with the
Administration Console 330
Attaching an Action Sequence to a Schedule 333
Monitoring 333
Logging 333
Inspecting the Log 333
Logging Levels 335
Writing Custom Messages to the Log 336
E‑mail Notifications 336
Configuring the Mail Job Entry 337
Summary 340
Chapter 13 Versioning and Migration 341
Version Control Systems 341
File-Based Version Control Systems 342
Organization 342
Leading File-Based VCSs 343
Content Management Systems 344
Kettle Metadata 344
Kettle XML Metadata 345
Transformation XML 345
Job XML 346
Global Replace 347
Kettle Repository Metadata 348
The Kettle Database Repository Type 348
The Kettle File Repository Type 349
The Kettle Enterprise Repository Type 350
Managing Repositories 350
Exporting and Importing Repositories 350
Upgrading Your Repository 351
Version Migration System 352
Managing XML Files 352
Managing Repositories 352
Parameterizing Your Solution 353
Summary 356
Chapter 14 Lineage and Auditing 357
Batch-Level Lineage Extraction 358
Lineage 359
Lineage Information 359
Impact Analysis Information 361
Logging and Operational Metadata 363
Logging Basics 363
Logging Architecture 364
Setting a Maximum Buffer Size 365
Setting a Maximum Log Line Age 365
Log Channels 366
Log Text Capturing in a Job 366
Logging Tables 367
Transformation Logging Tables 367
Job Logging Tables 373
Summary 374
Part IV Performance and Scalability 375
Chapter 15 Performance Tuning 377
Transformation Performance: Finding the Weakest Link 377
Finding Bottlenecks by Simplifying 379
Finding Bottlenecks by Measuring 380
Copying Rows of Data 382
Improving Transformation Performance 384
Improving Performance in Reading Text Files 384
Using Lazy Conversion for Reading Text Files 385
Single-File Parallel Reading 385
Multi-File Parallel Reading 386
Configuring the NIO Block Size 386
Changing Disks and Reading Text Files 386
Improving Performance in Writing Text Files 387
Using Lazy Conversion for Writing Text Files 387
Parallel Files Writing 387
Changing Disks and Writing Text Files 387
Improving Database Performance 388
Avoiding Dynamic SQL 388
Handling Roundtrips 388
Handling Relational Databases 390
Sorting Data 392
Sorting on the Database 393
Sorting in Parallel 393
Reducing CPU Usage 394
Optimizing the Use of JavaScript 394
Launching Multiple Copies of a Step 396
Selecting and Removing Values 397
Managing Thread Priorities 397
Adding Static Data to Rows of Data 397
Limiting the Number of Step Copies 398
Avoiding Excessive Logging 398
Improving Job Performance 399
Loops in Jobs 399
Database Connection Pools 400
Summary 401
Chapter 16 Parallelization, Clustering, and Partitioning 403
Multi-Threading 403
Row Distribution 404
Row Merging 405
Row Redistribution 406
Data Pipelining 407
Consequences of Multi-Threading 408
Database Connections 408
Order of Execution 409
Parallel Execution in a Job 411
Using Carte as a Slave Server 411
The Configuration File 411
Defining Slave Servers 412
Remote Execution 413
Monitoring Slave Servers 413
Carte Security 414
Services 414
Clustering Transformations 417
Defining a Cluster Schema 417
Designing Clustered Transformations 418
Execution and Monitoring 420
Metadata Transformations 421
Rules 422
Data Pipelining 425
Partitioning 425
Defining a Partitioning Schema 425
Objectives of Partitioning 427
Implementing Partitioning 428
Internal Variables 428
Database Partitions 429
Partitioning in a Clustered Transformation 430
Summary 430
Chapter 17 Dynamic Clustering in the Cloud 433
Dynamic Clustering 433
Setting Up a Dynamic Cluster 434
Using the Dynamic Cluster 436
Cloud Computing 437
EC2 438
Getting Started with EC2 438
Costs 438
Customizing an AMI 439
Packaging a New AMI 442
Terminating an AMI 442
Running a Master 442
Running the Slaves 443
Using the EC2 Cluster 444
Monitoring 445
The Lightweight Principle and Persistence Options 446
Summary 447
Chapter 18 Real-Time Data Integration 449
Introduction to Real-Time ETL 449
Real-Time Challenges 450
Requirements 451
Transformation Streaming 452
A Practical Example of Transformation Streaming 454
Debugging 457
Third-Party Software and Real-Time Integration 458
Java Message Service 459
Creating a JMS Connection and Session 459
Consuming Messages 460
Producing Messages 460
Closing Shop 460
Summary 461
Part V Advanced Topics 463
Chapter 19 Data Vault Management 465
Introduction to Data Vault Modeling 466
Do You Need a Data Vault? 466
Data Vault Building Blocks 467
Hubs 467
Links 468
Satellites 469
Data Vault Characteristics 471
Building a Data Vault 471
Transforming Sakila to the Data Vault Model 472
Sakila Hubs 472
Sakila Links 473
Sakila Satellites 474
Loading the Data Vault: A Sample ETL Solution 477
Installing the Sakila Data Vault 477
Setting Up the ETL Solution 477
Creating a Database Account 477
The Sample ETL Data Vault Solution 478
Sample Hub: hub_actor 478
Sample Link: link_customer_store 480
Sample Satellite: sat_actor 483
Loading the Data Vault Tables 485
Updating a Data Mart from a Data Vault 486
The Sample ETL Solution 486
The dim_actor Transformation 486
The dim_customer Transformation 488
The dim_film Transformation 492
The dim_film_actor_bridge Transformation 492
The fact_rental Transformation 493
Loading the Star Schema Tables 495
Summary 495
Chapter 20 Handling Complex Data Formats 497
Non-Relational and Non-Tabular Data Formats 498
Non-Relational Tabular Formats 498
Handling Multi-Valued Attributes 498
Using the Split Field to Rows Step 499
Handling Repeating Groups 500
Using the Row Normaliser Step 500
Semi- and Unstructured Data 501
Kettle Regular Expression Example 503
Configuring the Regex Evaluation Step 504
Verifying the Match 507
Key/Value Pairs 508
Kettle Key/Value Pairs Example 509
Text File Input 509
Regex Evaluation 510
Grouping Lines into Records 511
Denormaliser: Turning Rows into Columns 512
Summary 513
Chapter 21 Web Services 515
Web Pages and Web Services 515
Kettle Web Features 516
General HTTP Steps 516
Simple Object Access Protocol 517
Really Simple Syndication 517
Apache Virtual File System Integration 517
Data Formats 517
XML 518
Kettle Steps for Working with XML 518
Kettle Job Entries for XML 519
HTML 520
JavaScript Object Notation 520
Syntax 521
JSON, Kettle, and ETL/DI 522
XML Examples 523
Example XML Document 523
XML Document Structure 523
Mapping to the Sakila Sample Database 524
Extracting Data from XML 525
Overall Design: The import_xml_into_db Transformation 526
Using the XSD Validator Step 528
Using the “Get Data from XML” Step 530
Generating XML Documents 537
Overall Design: The export_xml_from_db Transformation 537
Generating XML with the Add XML Step 538
Using the XML Join Step 541
SOAP Examples 544
Using the “Web services lookup” Step 544
Configuring the “Web services lookup” Step 544
Accessing SOAP Services Directly 546
JSON Example 549
The Freebase Project 549
Freebase Versus Wikipedia 549
Freebase Web Services 550
The Freebase Read Service 550
The Metaweb Query Language 551
Extracting Freebase Data with Kettle 553
Generate Rows 554
Issuing a Freebase Read Request 555
Processing the Freebase Result Envelope 556
Filtering Out the Original Row 557
Storing to File 558
RSS 558
RSS Structure 558
Channel 558
Item 559
RSS Support in Kettle 560
RSS Input 561
RSS Output 562
Summary 567
Chapter 22 Kettle Integration 569
The Kettle API 569
The LGPL License 569
The Kettle Java API 570
Source Code 570
Building Kettle 571
Building javadoc 571
Libraries and the Class Path 571
Executing Existing Transformations and Jobs 571
Executing a Transformation 572
Executing a Job 573
Embedding Kettle 574
Pentaho Reporting 574
Putting Data into a Transformation 576
Dynamic Transformations 580
Dynamic Template 583
Dynamic Jobs 584
Executing Dynamic ETL in Kettle 586
Result 587
Replacing Metadata 588
Direct Changes with the API 589
Using a Shared Objects File 589
OEM Versions and Forks 590
Creating an OEM Version of PDI 590
Forking Kettle 591
Summary 592
Chapter 23 Extending Kettle 593
Plugin Architecture Overview 593
Plugin Types 594
Architecture 595
Prerequisites 596
Kettle API Documentation 596
Libraries 596
Integrated Development Environment 596
Eclipse Project Setup 597
Examples 598
Transformation Step Plugins 599
StepMetaInterface 599
Value Metadata 605
Row Metadata 606
StepDataInterface 607
StepDialogInterface 607
Eclipse SWT 607
Form Layout 607
Kettle UI Elements 609
Hello World Example Dialog 609
StepInterface 614
Reading Rows from Specific Steps 616
Writing Rows to Specific Steps 616
Writing Rows to Error Handling 617
Identifying a Step Copy 617
Result Feedback 618
Variable Substitution 618
Apache VFS 619
Step Plugin Deployment 619
The User-Defined Java Class Step 620
Passing Metadata 620
Accessing Input and Fields 620
Snippets 620
Example 620
Job Entry Plugins 621
JobEntryInterface 622
JobEntryDialogInterface 624
Partitioning Method Plugins 624
Partitioner 625
Repository Type Plugins 626
Database Type Plugins 627
Summary 628
Appendix A The Kettle Ecosystem 629
Kettle Development and Versions 629
The Pentaho Community Wiki 631
Using the Forums 631
Jira 632
##pentaho 633
Appendix B Kettle Enterprise Edition Features 635
Appendix C Built-in Variables and Properties Reference 637
Internal Variables 637
Kettle Variables 640
Variables for Configuring VFS 641
Noteworthy JRE Variables 642
Index 643