Wednesday, July 21, 2010

Setup Kettle debugging in 2 minutes

Sometimes it could be very interesting to be able to debug something to understand better how it works from an internal standpoint. My experience says that this is the case with the majority the opensource projects: sooner or later this is will happen.

Today that happened with Kettle trying to better understand if my assumptions were correct.

So how can we easily debug Kettle? The answer is very easy is: use remote debugging. I'll explain everything in a minute. The development tool I'm using is IntelliJ IDEA but it is fairly simple to set up everything with Eclipse and Netbeans.

Setup Kettle to enable remote debugging

1) If you're on Windows, open Spoon.bat and add the following line

 set JAVA_OPTS=-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n  

The same on Linux will be to open and add the following line

 export JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n"  

I've used the port 8000 as the socket port number at which the debugger attaches but feel free to use the one you prefer.

2) Go to the last line of Spoon .bat or and update that last line as follow

Linux (


Windows (Spoon.bat):

 start javaw %JAVA_OPTS% %OPT% -jar launcher\launcher.jar -lib %LIBSPATH% %_cmdline%  

Configure IntelliJ IDE for remote debugging

1) Configure a project for Kettle source tree
2) Choose Build > Edit configurations from the menu
3) The Run/Debug Configurations dialog opens. Choose Add New Configuration (the + sign in the upper left corner) and select Remote from the Add New Configuration list

3) The Remote Configuration dialog opens

Configure Transport and Debugger mode choices as displayed in the image above. The set the Host address (localhost in this case or wherever you need) and the port (8000 in my case) and press Ok

And now let's dance...

So now start Kettle. As soon as Kettle started load you job or transformation and, before running it, launch the debugger from your IDE. As soon as the debugger started you're ready to launch your job/transformation and begin your investigation. The advantage of this approach is the ability to debug also remotely running kettle instances and that is useful sometimes in helping to investigate problems.

Have fun playing with your Kettle and see you the next article.

Monday, July 19, 2010

CDF internationalization early access release

Today I finally completed the first implementation of the CDF Internationalization support. It is my first big effort to help in CDF and I'm proud I had the opportunity to work on a so important missing feature.
This is a first release, a sort of early access release, and it is available after building the framework sources from the repository trunk. I hope anyone can share ideas or suggestions on this feature helping making it better.

How does it works
CDF internationalization uses a jQuery i18n plugin to handle internationalized messages. The plugin support has been wrapped in the CDF framework so that the user never had to bother about jQuery specifics but will use the support CDF gives for that. Everything rounds about resource message files as usual in the java platform.

To create resources files we need to create  a file for each locale our dashboard is going to support. These files are named <name>.properties, or <name>_<language>.properties or <name>_<language>_<country>.properties. For instance a resource bundle for UK will be In case the message file will not be present the framework support will display the default message string and you'll get a warning in the bi server message console (nothing to be worried about but it would be better to have the required missing resource file).
The <language> argument is a valid ISO Language Code. These codes are the lower-case, two-letter codes as defined by ISO-639. You can find a full list of these codes at a number of sites, such as:
The <country> argument is a valid ISO Country Code. These codes are the upper-case, two-letter codes as defined by ISO-3166. You can find a full list of these codes at a number of sites, such as:

How resource bundles are managed in CDF
We have two level of messages files in CDF
  • Global message files: located in <biserver_home>/pentaho_solutions/system/pentaho_cdf/resources/languages and useful  when used in dashboard global templates.
  • Dashboard specific message files: located in <dashboard_home> they can be distributed together with single dashboard in a sort of "package".  To enable Dashboard specific message files support we've to add a <messages> element to the xcdf file configuration whose value is the base name of the dashboard specific message file.

The CDF.i18n tag: internationalization support in HTML dashboard templates
To use the internationalization support in dashboards HTML templates we implemented a tag to use to get the message string from the resource bundle and use it where required in the HTML template file. This is useful for descriptions, annotations, generic labels and similar.

We can use the CDF.i18n tag using the following syntax:


where <message_key> is the message key located in the message resource bundle files.

Internationalization support in CDF components
Because of the early preview stage of this first release, for the moment, only the jFreechart component supports internationalization. In this case the support applies to the chart's title attribute.
To support an internationalized title in a JFreechart component we added a new titleKey metadata attribute whose value is the message key for the chart title we can found in the resource bundle. A complete internationalization support for all the CDF components (where needed) will come shortly so be patient.

And here we go...
For anyone who want to have a first look on this, in the bi-developer solution you have in the repository trunk has a fully working example you can have a look and experiment with. Below two screenshots of the sample dashboard I made internationalized in the Italian language. I hope you enjoy playing with this first release and stay tuned here for the next updates from the field.

Monday, July 12, 2010

Joining and appending output datasets with CompoundDataAccess

An interesting topic about CDA is the ability apply join and union constructs to our incoming dataset using a  CompoundDataAccess element. This definition is a sort of extension of the DataAccess element and through a type attribute gives us the ability to join or append two given DataAccess datasets outputs forming a new one.

How does CompoundDataAccess works internally

It is really interesting to have a look at the sources and see how the things really works:
  • when you want to join the two DataAccess outputs CDA builds a little Kettle transformations and uses the Merge Join step two do the required join operation
  • if you want to append the two DataAccess outputs CDA very easily appends by itself the two datasets following the order you specified in configuration file.

Remember that in either cases you can apply this operations considering only two datasets at the time for each CompoundDataAccess element definition.

CompoundDataAccess basics

To define a new CompoundDataAccess element you have to mandatory specify
  • an id attribute, to externally identify this DataAccess elemement
  • a type attribute that accepts two values union or join following the type of composition you're looking for.
Because CompoundDataAccess is only an extension of the DataAccess element, we can apply the Parameter, Columns, CalculatedColumns and Output elements following the same rules specified for DataAccess in my previous post here

Moreover, because the CompundDataAccess element works on resulting DataAccess datasets we have to preliminarily define the two needed DataAccess. It is not possible (but would be a nice to have and I can think about implementing it) use DataAccess elements defined in external CDA files.

Joining datasets output
To join two datasets we have to define a CompoundDataAccess element with the type attribute valued to the join value. The join type lets you execute a FULL OUTER JOIN upon the two input datasets. Required elements are:

  • Left element definition, to specify a dataset for the left part of the join,
  • Right element definition,  to specify a dataset for the right part of the join.
Anyone of the two elements accepts the following attributes:

  1. id, is valued to the id of the DataAccess element definition you are going to consider as the left/right dataset
  2. key, is a comma separated list of elements to be considered as the join key. Remember that the columns keys are specified giving their position in the DataAccess query and not the name.

Below I give an example to fully understand this concept. Suppose we defined the following two DataAccess definitions in our CDA file

   <DataAccess id="1" connection="1" type="sql" access="private" cache="true" cacheDuration="300">  
                <Name>Sql Query on SampleData - Jdbc</Name>  
                     select o.YEAR_ID, o.STATUS as status, sum(o.TOTALPRICE) as totalprice from orderfact o  
                     group by o.YEAR_ID, o.STATUS  
   <DataAccess id="2" connection="1" type="sql" access="public" cache="true" cacheDuration="5">  
                <Name>Sql Query on SampleData</Name>  
                     select o.YEAR_ID, o.status, sum(o.TOTALPRICE * 3) as tripleprice from orderfact o  
                     where o.STATUS = ${status} and o.ORDERDATE &gt; ${orderDate}  
                     group by o.YEAR_ID, o.STATUS  
                     order by o.YEAR_ID DESC, o.STATUS  

and suppose in the same cda file we define this CompoundDataAccess element.

      <CompoundDataAccess id="3" type="join">  
           <Left id="1" keys="0,1"/>  
           <Right id="2" keys="0,1"/>  
       <Parameter name="status" type="String" default="Shipped"/>  
       <Parameter name="orderDate" type="Date" pattern="yyyy-MM-dd" default="2003-03-01"/>  
           <Output indexes="0,1,2,5,6"/>  

As you can see the CompoundDataAccess element has the two keys, Left and Right, two indicate the two sides of the join. The left one uses the DataAccess id 1 (the first one above) and the second the DataAcccess id 2. Then you see the comma separated list that defines the positions of the keys columns in the related DataAccess query. In this case you can see, looking at the related DataAcces queries, that either the elements uses YEAR_ID and the STATUS columns

Appending datasets ouputs
To append two datasets we have to define a CompoundDataAccess element with the type attribute valued to the union value.The union type lets you execute append two DataAcces output datasets giving their order using two configuration elements:
  • Top, to specify the first dataset,
  • Bottom,  to specify the second dataset.

Anyone of the two elements accepts the following attributes:
  1. id, is valued to the id of the DataAccess element definition you are going to consider as the Top/Bottom dataset
Below a sample of configuration using a compound union. The DataAccess elements considered for this sample are the same defined in the previous sample for the join type.

   <CompoundDataAccess id="3" type="union">  
     <Top id="2"/>  
     <Bottom id="1"/>  
       <Parameter name="year" type="Numeric" default="2004"/>  

Friday, July 9, 2010

CDA configuration files basics (part 2)

Last time we started talking about the basics of CDA configuration. Today we'll continue our series of articles on CDA better at DataAccess element configuration.

A little ricap

CDA is a BI Platform plugin that that allows to fetch data in various formats. It works defining a configuration file where we will define connections to our available datasources and a number of data access objects. The data access objects defines the strategy used to access the data using a specified connection. In our previous post we talkes briefly about how to configure Connections a we give a brief overview about the DataAccess element. Today we will complete our introduction to the basics of  CDA DataAccess configuration.

Adding parameters to our DataAccess element

The Parameters element, nested into the DataAccess element, let you define a set of  parameters required to get the data from you connection. Every parameter is associated to a Parameter element which accepts at least the following attributes:

  • name, to specify the Parameter name.
  • type, lets you specify the Parameter type. Allowed types are String, Date, Integer, Numeric.
  • pattern, lets you specify a display pattern required to let the CDA better understand the parameter's value as inputted by the user. Samples of usage for this attribute are dates dates or numeric parameters.
  • default, lets you specify a default value for the parameter so that it will be automatically assigned every time we don't give that value

Below you've a sample about configuring a Parameters element

   <DataAccess id="1" connection="1" type="sql" access="public" cache="true" cacheDuration="300">  
                     select o.YEAR_ID, o.STATUS, sum(o.TOTALPRICE) as price from orderfact o  
                     where o.STATUS = ${status} and o.ORDERDATE &gt; ${orderDate}  
                     group by o.YEAR_ID, o.STATUS  
                     <Column idx="0">  
    <!-- Output controls what is outputed and by what order. Defaults to everything -->  
    <Output indexes="1,0,2,3"/>  

For your reference I've also shown you the query definition. Look at the Columns element declaration: you can see that we're going to rename the column with ordinal 0 from "YEAR_ID" to "Year". To do that we used a Column element where we set the new name.
Then, using the CalculatedColumn element, we added a new calculated column name PriceInK.
Consider that:
  • New calculated columns are always added at the end of the columns set
  • CDA uses Pentaho reporting libraries to apply formulas so that means you need to identify streams columns in your formula using square bracket notation and that you can use all the construct found in that libraries.
    Last but not least we changed the positional layout of the output columns through with an Output element. In the indexes attributes we set the columns position using column indexes. You have two consider two things here

    • The total columns number is 4 because 3 came from the query and one was the calculated column lastly added to the stream
    • The index for the first leftmost column is 0 and NOT 1