Skip to main content

Open Source ETL tools vs Commercial ETL tool

Open Source ETL tools vs Commercial ETL tool


The ETL-tools are validated on the following categories

InfrastructureFunctionalityUsability
Platforms supportedDebugging facilitiesData Quality / profiling
PerformanceFuture prospectsReusability
ScalabilityBatch vs Real-timeNative connectivity


Figure 1: Simple schematic for a data warehous...

Pentaho Kettle vs Talend


Pentaho
  1. Pentaho is a commerical open-source BI suite that has a product called Kettle for data integration.
  2. It uses an innovative meta-driven approach and has a strong and very easy-to-use GUI.
  3. The company started around 2001 (2002 was when kettle was integrated into it).
  4. It has a strong community of 13,500 registered users.
  5. It has a stand-alone java engine that process the jobs and tasks for moving data between many different databases and files.
  6. It can schedule tasks (but you need a schedular for that - cron).
  7. It can run remote jobs on "slave servers" on other machines.
  8. It has data quality features: from its own GUI, writing more customised SQL queries, Javascript and regular expressions.


Talend
  1. Talend is an open-source data integration tool (not a full BI suite).
  2. It uses a code-generating approach. Uses a GUI, but within Eclipse RC.
  3. It started around October 2006
  4. It has a much smaller community then Pentaho but has 2 finance companies supporting it.
  5. It generates java or perl code which you later run on your server.
  6. It can schedule tasks (also with using schedulars like cron).
  7. It has data quality features: from its own GUI, writing more customised SQL queries and Java.


Comparison - (from my understanding)
  • Pentaho is faster (twice as fast maybe) then Talend.
  • Pentaho's GUI is easier to use then Talend's GUI and takes less time to learn.


My impression
Pentaho is easier to use because of its GUI.
Talend is more a tool for people who are making already a Java program and want to save lots and lots of time with a tool that generates code for them.



Assuming Pentaho made it to the next round....

Pentaho Kettle vs Informatica

Informatica
  1. Informatica is a very good commercial data integration suite.
  2. It was founded in 1993
  3. It is the market share leader in data integration (Gartner Dataquest)
  4. It has 2600 customers. Of those, there are fortune 100 companies, companies on the Dow Jones and government organization.
  5. The company's sole focus is data integration.
  6. It has quite a big package for enterprises to integrate their systems, cleanse their data and can connect to a vast number of current and legacy systems.
  7. Its very expensive, will require training some of your staff to use it and probably require hiring consultants as well. (I hear Informatica consultants are well paid).
  8. Its very fast and can scale for large systems. It has "Pushdown Optimization" which uses an ELT approach that uses the source database to do the transforming - like Oracle Warehouse Builder.


Comparison
  • Pentaho's Javascipt is very powerful when writing transformation tasks.
  • Informatica has many more enterprise features, for example, load balancing between database servers.
  • Pentaho's GUI requires less training then Informatica.
  • Penatho doesn't require huge upfront costs as Informatica does. (that part you saw coming, I'm sure)
  • (edited)Informatica is faster then Pentaho. Infromatica has Pushdown Optimization, but with some tweaking to Pentaho and some knowledge of the source database, you can improve the speed of Pentaho. (also see line below)
  • (new)You can place Pentaho Kettle on many different servers (as many as you like, its free) and use it as a cluster.
  • Informatica has much better monitoring tools then Pentaho.

Comments

Post a Comment

Popular posts from this blog

TALEND Interview questions and Answers

TALEND Interview questions and Answers (http://www.deepinopensource.com/talend-interview-questions/) 1.    Talend – Merge multiple files into single file with sorting operation. 2.    Loading Fact Table Using Talend 3.    ROWNUM Analytical Function in Talend 4.    SCD-2 Implementations in Talend 5.    Deployment strategies in Talend 6.    Custom Header Footer in Talend 7.    Data Masking Using Talend 8.    How to use Shared DB Connection in Talend 9.    Load all rows from source to target except last 5 10.    Late Arriving Dimension Using Talend 11.    Date Dimension Using Talend 12.    Dynamic Column Ordering Of Source File Using Talend 13.    Incremental Load Using Talend 14.    Getting Files From FTP Server 15.    Initializing Context At Run Time Using Po...

Error Handling Mechanism in Talend Open Studio

Error Handling Mechanism in Talend Open Studio Three Error Handling Strategies in Talend Open Studio You can recover from some errors.  Others, like system or network failures are fatal.  But even in the fatal case, your Talend Open Studio job should die gracefully, notifying the operations team and leaving the data in a good state.  This post presents three error handling strategies for your Talend jobs. Some Talend Open Studio job errors are alternate paths that, though infrequent, occur often enough to justify special programming. This programming may come in the form of guard conditions, special logic applied to route the special case to another sub job.  For an example of these type of errors, see this blog post on  ETL Filter Patterns . Other errors are related to system and network activity or are bugs.  There are a few ways to handle this class of error in Talend Open Studio. Do Nothing For simple jobs, say an automated administrative t...

Talend Interview Questions

You came across here that means it is worth of writing this post. 🙂 Whenever I go for the interview there will be some new questions, so I thought why not to draft all these questions at single place? It is just attempt to remember all Talend Interview question nothing else. Difference between tMap and tJoin component in Talend . Difference between tAggregaterow and tAggregatesortedrow. Difference between tJava,tJavarow,tJavaflex. How to improve the performance of Talend job having complex design? Difference between built in schema and Repository. What is the declaration of method which we define in system routine? What is XMS and XMX parameter in Talend? How to resolve heap space issue in Talend ? How to do the exception handling in Talend? What is Default join for tMap. What are the different lookup patterns available with Talend? What is the basic requirement while updating the perticular table? How to generate surrogate key by using Talend? What is the use of E...