HgIS

Správa a analýza dat o životním prostředí
Environmental data management and analysis

User Tools

Site Tools


en:cheatsheet

Pentaho Data Integration Cheat Sheet

This is a short guideline for Pentaho Data Integration (PDI) – mainly with Spoon – the development environment. First read general information about Pentaho platform and PDI.

How to start

  1. Install Java (64-bit)1).
  2. Unzip the file to the folder of your choice.
  3. Run Spoon.bat.
  4. Go to the Design tab.
  5. Drag and drop items from the left bar to the canvas.

Some steps used in trnsformations

Symbol Name Description
Text file input Use for CSV also (not CSV file input that cannot process the whole folder).
Other steps for data input and output from/to databases, other sources (e-mail, local computer, FTP, HTTP) and files (MS Excel, MS Access, ESRI SHP, XML, JSON, YAML, RSS, dBase, ZIP etc.)
Text file output Can set huge length and return an error. Solution: do not define length.
Table output
Microsoft Excel Writer
Filter rows For multiple options use Switch-Case.
Formula More functions than Calculator.
Calculator Faster than Formula.
Group by
Select values
Sort rows Also an option: Only pass unique rows?
Replace in string
Split Fields
Stream lookup To join two strems (tables) without need to sort them.
Row Normaliser Type field (name of the new column of categories)
Fieldname (input header)
Type (values of input categories)
New field (output header of values) – needs to be one value for all
Row denormaliser Key – input categories.
The key field (name of the input colums with categories),
Group field (what identifies the whole future row – e.g. filename),
Target fieldname = Key value (single categories),
Value fieldname (name of the input column with values)
More: Microsoft Power Query for Excel
Set Variables In other tranformations this variable can be used as a variable of as a parameter. Parameter can have a default value (taken into effect if the variable is not defined).
ETL Metadata Injection To control the transformations. Combine with Transformation Executor.
Best practices.
Matt Casters: Parse nasty XLS with dynamic ETL
At the end of the article is an example including source codes.
Transformation Executor Every row runs a new transformation.
Add Constants
Analytic Query To involve data from multiple rows. Aggregation.
Mail
Modified Java Script Value
User Defined Java Expression
Pentaho Reporting Output Feed and create reports designed in PRD.
Add sequence
Regex evaluation Regular expressions. My examples bellow.
Dummy (do nothing) Useful for merging streams or to see result of some step (e.g. Filter rows).

Another steps are available in Marketplace:

Often used job entries

Regular expressions

Table Selection of the input files (regex corresponds to the file name)

Description Reguar expression
.xlsx files .*\.xlsx
All files.*
Files starting with facts facts.*

Table Select part of a text string

Description Regular expression Input Output
Between brackets
.*\((.*)\).*
stanice: ČK (9 m n.m) 9 m n.m
Up to ,,; or .([^\s?:(?!;).]+).*
Up to ,, ;, . or similar([^\s]+).*

Tips and tricks

  • Empty rows in GUI dialogs cause errors.
  • Manage errors in separate streams. First step in a transformation cannot deal with error rows because they do not exist in PDI yet.
  • Use ETL metadata injection step to for more complex transformations (see above).
  • Use variables and parameters (see above)
  • Use relative paths (${Internal.Entry.Current.Directory})
  • Check Date Format Lenient or Lenient number conversion if data types is not resolved properly or returns error.
  • To export to e.g. SQLite first create the table by SQL and then load data there (even in the same transformation because scripts are executed first).
  • Automatically source Metadata for ETL Metadata Injection (automated from different Excel spreadsheets)
  • kettle-cookbook Automated documentation
  • Best Practices – detailed pdf
  • Video history of PDI
    • When I start Spoon.bat in a Windows environment nothing happens. How can I solve it?
      • Edit the Spoon.bat file and:
        • Replace in the last line start javaw with only java.
        • Add a pause in the next line.
        • Save and try it again.
    • How to use JNDI?
      • If you look inside the PDI main directory you'll see a sub-directory called simple-jndi, which contains a file named jdbc.properties. You should change this file so the JNDI information matches the one you use in your application server.

References

ROLDÁN, María Carina, 2017. Learning Pentaho Data Integration 8 CE : Third Edition. Packt Publishing. ISBN 978-1-78829-007-4.

1)
64-bit is necessary!
If you need open-source Java, use https://jdk.java.net/12/.
If you cannot install it use portable version.
This website uses cookies for visitor traffic analysis. By using the website, you agree with storing the cookies on your computer.More information
en/cheatsheet.txt · Last modified: 2019-09-15