Home > Our Innovation > Talend Spark Kudu Components

Talend Spark Kudu Components

Introduction

Three components have been created which allow Talend users to create Big Data Batch jobs with Spark using Kudu:

  1. tKuduConfiguration
  2. tKuduInput
  3. tKuduOutput

This component set allows to connect to Kudu, read from and write to Kudu tables.

tKuduConfiguration

This component allows to setup the connection details to the Kudu server. As of now you can only set the servers and ports to which to connect.

tKuduConfiguration
tKuduConfiguration2

This component is then used internally by the two other components tKuduInput and tKuduOutput which cannot operate without these.

tKuduInput

This component reads data in form of an RDD from a specified Kudu table. It can either scan the whole table or filter rows from a Kudu table using Kudu predicates.

The supported predicates are:

1. EQUALS
2. GREATER
3. GREATER EQUALS
4. LESS
5. LESS EQUALS
6. IS NULL
7. IS NOT NULL

It also allows to set a limit to the output rows.

tKuduInput
tKuduInput2
tKuduInput3

Simple Example Job

tKuduInput4

This job sets up two configuration components which setup a connection to HDFS (tHDFSConfiguration_1) and a connection to the Kudu server (tKuduConfiguration_1). The tKuduInput_1 component is then used to read data from a Kudu table:

tKuduInput5

The output produced is then consumed by a tLogRow component. It could be consumed by any other Spark batch component though.

tKuduOutput

This component supports the following functionality:

1. Create (also re-create) a table including two types of partitions on primary keys:
a. Range partitions (at the moment it only supports one single column)
b. Hash partitions
2. Perform writing operations on a table:
a. Insert – insert new records
b. Delete – delete existing records
c. Update – update existing records
d. Upsert – update or if the record does not exist, insert a new record

tKuduOutput
tKuduOutput2

Limitations

The components have some known limitations. The most important one is the lack of support for the Date data format. This is due to the fact that Talend uses internally java.util.Date in the avro records and this cannot be converted to the Kudu timestamp data type.

Supported Versions

These components have been tested with Talend 6.4.1. But they should also work with Talend 6.5.1.

Installation

These components can be installed in Talend Studio in the following way:

1. Unpack the provided zip file into a local folder ()
2. Start Talend Studio
3. Go to Menu “Window” -> “Preferences”
4. Type “Component” in the search field on the top left side of this dialogue:

KuduComponentInstallation

5. Click on “Components” in the tree on the left panel
6. Write the path to in the field with label “User component folder:”
7. Click on “Talend Component Designer” in the tree on the left panel
8. Write the path to in the field with label “Component project:”
9. Click the “OK” button