Kamil Nešetřil

Environmental data management and analytics

User Tools

Site Tools


Pentaho Data Integration Cheat Sheet

This is a short guideline for Kettle: Pentaho Data Integration (PDI) – mainly with Spoon – the development environment. First read general information about Pentaho platform and PDI.

Since Hitachi acquired Pentaho, the development of the Pentaho platform has stagnated, and the platform's website is confusing. The original author of PDI (Matt Casters) created Apache Hop, a fork of PDI that finally has elegant architecture and is being intensively developed. PDI transformations can be imported into it.

How to start

  1. Install Java (64-bit)1).
  2. Unzip the file to the folder of your choice.
  3. Run Spoon.bat. In case of problems: video
  4. Go to the Design tab.
  5. Drag and drop items from the left bar to the canvas.

Some steps used in transformations

Symbol Name Description
Text file input Use for CSV also (not CSV file input that cannot process the whole folder).
Other steps for data input and output from/to databases, other sources (e-mail, local computer, FTP, HTTP), and files (MS Excel, MS Access, ESRI SHP, XML, JSON, YAML, RSS, dBase, ZIP, etc.)
Text file output Can set huge length and return an error. Solution: do not define length.
Table output
Microsoft Excel Writer
Filter rows For multiple options, use Switch-Case.
Formula More functions than Calculator.
Calculator Faster than Formula.
Group by
Select values
Sort rows Also an option: Only pass unique rows?
Replace in string
Split Fields
Stream lookup To join two streams (tables) without the need to sort them.
Row Normaliser Type field (name of the new column of categories)
Fieldname (input header)
Type (values of input categories)
New field (output header of values) – needs to be one value for all
Row denormaliser Key – input categories.
The key field (name of the input colums with categories),
Group field (what identifies the whole future row – e.g. filename),
Target fieldname = Key value (single categories),
Value fieldname (name of the input column with values)
More: Microsoft Power Query for Excel
Set Variables In other transformations, this variable can be used as a variable or as a parameter. The parameter can have a default value (taken into effect if the variable is not defined).
ETL Metadata Injection To control the transformations. Combine with Transformation Executor.
Best practices.
Matt Casters: Parse nasty XLS with dynamic ETL
At the end of the article is an example including source codes.
Alternative: run the transformation in job and check Execute every input rowvideo.
Transformation Executor Every row runs a new transformation.
Add Constants
Analytic Query To involve data from multiple rows. Aggregation.
Mail
Modified Java Script Value
User Defined Java Expression
Pentaho Reporting Output Feed and create reports designed in PRD.
Add sequence
Regex evaluation Regular expressions. My examples below.
Dummy (do nothing) Useful for merging streams or to see result of some step (e.g. Filter rows).

Other steps:

Often used job entries

Regular expressions

Table Selection of the input files (regex corresponds to the file name)

Description Reguar expression
.xlsx files .*\.xlsx
All files.*
Files starting with facts facts.*

Table Select part of a text string

Description Regular expression Input Output
Between brackets
.*\((.*)\).*
stanice: ČK (9 m n.m) 9 m n.m
Up to ,,; or .([^\s?:(?!;).]+).*
Up to ,, ;, . or similar([^\s]+).*

Tips and tricks

  • Empty rows in GUI dialogs cause errors.
  • Manage errors in separate streams. The first step in a transformation cannot deal with error rows because they do not exist in PDI yet.
  • Use ETL metadata injection step for more complex transformations (see above).
  • Use variables and parameters (see above)
  • Use relative paths (${Internal.Entry.Current.Directory})
  • Check Date Format Lenient or Lenient number conversion if data type is not resolved properly or returns error.
  • To export to e.g. SQLite, first create the table by SQL and then load data there (even in the same transformation because scripts are executed first).
  • Automatically source Metadata for ETL Metadata Injection (automated from different Excel spreadsheets)
  • kettle-cookbook Automated documentation
  • Best Practices – detailed pdf
  • Video history of PDI
    • When I start Spoon.bat in a Windows environment nothing happens. How can I solve it?
      • Edit the Spoon.bat file and:
        • Replace in the last line start javaw with only java.
        • Add a pause in the next line.
        • Save and try it again.
    • How to use JNDI?
      • If you look inside the PDI main directory you'll see a sub-directory called simple-jndi, which contains a file named jdbc.properties. You should change this file, so the JNDI information matches the one you use in your application server.

References

ROLDÁN, María Carina, 2017. Learning Pentaho Data Integration 8 CE : Third Edition. Packt Publishing. ISBN 978-1-78829-007-4.

1)
Open-source Java is available if needed.
If you cannot install it, use the portable version.
When Java gets updated then JAVA_HOME path in Windows needs to be updated accordingly.
Last modified: 2024-09-27