Learning Pentaho Data Integration 8 CE(Third Edition)
上QQ阅读APP看书,第一时间看更新

Putting the editing features in practice

You were just introduced to the basic editing features. Now we will create a new Transformation and put all that into practice. At the same time, you will learn to use new PDI steps.

The prerequisite for this exercise is to have a file with the source data. So, open your text editor, create a new file, and type the following:

project_name,start_date,end_date
Project A,2016-01-10,2016-01-25
Project B,2016-04-03,2016-07-21
Project C,2017-01-15,???
Project D,2015-09-03,2015-12-20
Project E,2016-05-11,2016-05-31
Project F,2011-12-01,2013-11-30

Save the file as projects.csv and proceed with the PDI work.

This new Transformation will read the list of projects from the file, and then it will calculate the time that it took to complete each project.

First of all, we will read the file so its content becomes our input dataset. Here are the instructions:

  1. Start Spoon.
  2. From the main menu, navigate to File | New | Transformation.
  3. Expand the Input branch of the Steps tree. Remember that the Steps tree is located in the Design tab to the left of the work area.
  4. Drag and drop the CSV file input icon on the canvas.

At any time, feel free to use a different method for adding steps or creating hops.

  1. Double-click on the CSV file input icon and enter projects in the Step name field.
  2. Under Filename, type the full path to your project's file, as, for example, D:/PentahoLearningSamples/files/projects.csv. Alternatively, click on Browse... and look for the file on your disk.
  3. Fill in the grid, as shown in the following screenshot:

Configuring the fields of a CSV file input step

In the previous chapter, you tried the Preview functionality for the first time. This feature allowed you to get an idea of how the data was being transformed. Inside an Input step, the functionality allows you to see how the input data looks. Let's try it now:

  1. Click on Preview, and in the small window that appears, click on OK to see the defined dataset. You should see a preview window with the six rows of data coming from the file:

Previewing input data

If you don't see the same results, go back to the instructions and verify that you configured the step properly.

  1. Close the window.

Now that we have some data to work with, we have to calculate the time that each project took. For doing that, let's introduce the Calculator step. This step allows you to create new fields by applying some calculations to the source data. In this case, we will use it for calculating the difference between two dates. In order to do that, we will have to provide the following:

  • A name for the new field: diff_dates 
  • The calculation to apply, which will be the difference between dates (in days)
  • The parameters for the calculation: start_date and end_date 
  • The type for the result: Integer

Proceed as follows:

  1. Expand the Transform branch of steps. Look for the Calculator step and drag and drop it to the work area.
  2. Create a hop from the CSV file input step towards the Calculator step by using any of the methods explained. A tiny menu will appear prompting you for the kind of hop. Among the options, select the Main output of step.

Make sure you create the hop as explained. If you don't do it, the fields will not be available in the next dialog window.

  1. Double-click on the Calculator step and fill the first row in the grid with the following information:

Configuring a Calculator step

  1. Click on OK to close the window.

Instead of using the Calculator step, it is also possible to do the math with the User Defined Java Expression step.

The main difference between a Calculator and a UDJE step is that while the Calculator has a list of predefined formulas, the UDJE allows you to write your own expressions using Java code.

Feel free to experiment and compare both alternatives: One using a Calculator and another using a UDJE.

For using a UDJE step in this exercise, you can enter the following Java expression that does the calculation:(dateB.getTime() - dateA.getTime())/ (1000 * 60 * 60 * 24)

Finally, we will evaluate the performance of the project:

  1. Add a new step, Number Range, and link the Calculator step to the Number range step with a new hop. Make sure that the arrow goes from the Calculator step toward the Number range step and not the other way.

If you have difficulty in finding a step, you can type the search criteria in the text box on top of the Steps tree. PDI will filter and show only the steps that match your search.

  1. With the Number ranges step, you will create a new field, performance, based on the value of an incoming field, diff_dates. Double-click on the step and fill in the grid as shown in the following screenshot. Then click on OK:

Configuring a Number ranges step

  1. Now from the Scripting branch, add a User Defined Java Expression step, and create a hop from the Number range step towards this new step. When you create the hop, you will be prompted for the kind of hop. Select Main output of step.

If you unintentionally select the wrong option, don't worry. Right-click on the hop and a contextual menu will appear. Select Delete hop and create the hop again.

  1. With the UDJE, you will create two informative messages: duration and message. As in the Calculator step, this step also allows you to create a new field per row. Double-click on the step and fill in the grid as shown in the following screenshot:

 Configuring a UDJE step

  1. Click on OK to close the window and save the Transformation. Your final Transformation should look like the following screenshot:

Transformation of projects