Taking a review of the new Talend Data Stewardship Console 6.3.1
Talend’s Data Stewardship Console is a web-based application which is part of Talend’s commercial data quality product suite (see https://www.talend.com/products/data-quality/).
This is how Talend documentation describes this product:
Talend Data Stewardship is a comprehensive tool you can use to manage data assets. It organizes data interactions whenever human intervention is required, for example, to do arbitration, review, merge or cleanse data.
Three basic concepts are key to operating with this product :
✓ Data models
The user interface displays these three concepts nicely in its navigation:
The next chapters give a basic introduction to these concepts.
Tasks can be assigned to users and have a type and a state.
Table 1 shows the main task types supported by the Talend Stewardship Console.
|Table 1. Data Stewardship Tasks|
|Arbitration||This type of task is based on a question and a data set. Typically, the data steward needs to filter records meeting a specific criterion. The steward classifies the data manually.|
|Resolution||The steward repairs the content of fields in a data set or adds missing data to empty fields.|
|Merging||The steward merges several potential duplicate records into one single record.|
In the following screenshot you can see merging tasks assigned to a user:
In order to be able to execute the abovementioned tasks, you need to declare the structure of your data – the data model. The data model is a container of attributes with specific properties.
The user interface of the Talend Stewardship console lists on the left side the names of the attributes. On the right side of the interface you have the properties of each single element.
Each element can be associated to an enumeration of values or a validation pattern.
|Note||It appears that a user cannot create complex data types (data types which contain subtypes, as in XML schemas) which would support hierarchical data models. All models you can address with the Talend Stewardship console are flat.|
Extending Data Types
You can extend semantic data types based on regular expressions in the Talend Dictionary Service.
You can create a semantic type based on a regular expression in the Talend Dictionary Service and add it to the list of recognized data types in Talend Data Stewardship.
All tasks are contained in campaigns which you can see as a “data stewardship project”. Campaigns can have the three types mentioned in Table 1:
A campaign contains a set of owners and stewards.
It is associated to a single data model and a workflow. In a workflow you can associate roles to each of the workflow steps.
Until now we have seen that the Data Stewardship Console allows you to perform operations on imported data. Yet we have not seen any mechanism in the web application which allows users to import data. It appears that data can only be imported via Talend data integrations jobs.
Talend has provided three brand new components in Talend Data Fabric 6.3.1 which allow users to send, retrieve and delete data in the Talend Data Stewardship database:
This component is used to create new tasks in an existing campaign.
This component is used to read the data rows in tasks of a campaign assigned to specific users with specific state and priority.
This component is used to delete tasks in a specific campaign.
|Note||If you have old Talend DI jobs which use the old components tStewardshipTaskOutput, tStewardshipTaskOutput and tStewardshipTaskDelete, they will not work with the new Stewardship console.|
Example of a Talend DI job ingesting tasks
In order to import tasks for a merging campaign you use the tMatchGroup component together with the tDataStewardshipOutput component. The following job generates a set of rows, compares the rows to each other and outputs those that have a low matching score to the Stewardship Console:
Under the Hood
Talend’s Data Stewardship Console has changed a lot in version 6.3.1. It now uses MongoDB to store data, a REST interface defined with Swagger and Kafka queues. The application also seems to have been written Spring Boot and on the client side with the React framework, the application used is Tomcat.
It also seems that the Stewardship Console uses the credentials created in TAC (Talend Administration Center).
It is a complete re-write of the old Stewardship console which was using the MySQL and SOAP interfaces.
The MongoDB database (a NoSQL document database) contains the following collections:
The relevant data entities are stored in Mongo Collections (schemas, tasks, campaigns).
The Talend Stewardship Console is a powerful tool which allows data stewards to correct, filter and merge records. In 6.3.1 Talend has created a new web application with a completely new UI and three new Studio components. The application was thoroughly re-vamped:
- The stewardship work is now much better organised, using campaigns and roles.
- The data model now is pre-defined before you can start your work.
On the technology side a lot has changed: the back-end is now MongoDB, the UI is React based, messaging is done via Kafka.
As far as we understand it, the Talend Stewardship Console does not support hierarchical schemas, as are needed for complex types and the capability of optionally working with dynamic schemas.