Data Training Made Easy With Cognigy.AI & Chatito

Raul Tiru
6 min readOct 5, 2022

Sufficient and correct Training data is one of the most vital ingredients of every successful conversational AI venture. At the same time, the availability of real-world Intents training data is also one of the key challenges while developing conversational AI virtual agents. In this catch 22 situation, synthetic data -computer-generated data that mimics real-world phenomena — comes in extremely handy.

According to this report, “An early adopter of synthetic data, Google’s Waymo self-driving car AI, is said to complete over three million miles of driving in simulation each day using synthetic data.”

But, how do we actually generate this synthetic data? In this post, I am going to demonstrate the same using the “Chatito” package for the Cognigy platform.

Introduction to training data in Cognigy

When we create virtual agents in Cognigy, we create an intent and add training data in the form of user utterances. Let’s explain this with an example of a Food ordering agent.

In the following example, “OrderFood” is the intent. The user has manually added various utterances and annotated them with appropriate Lexicon (1), that is, “fastfood”.

1. Lexicon: List of keyphrases

Analysis of Training data in Cognigy

After extracting the intent from Cognigy Console, this is how the .csv file looks like.

If you carefully analyse the training examples, the following format of user utterance is observed:

<User Utterance template> [[lexicon_value|lexicon_type]]

User Utterance Variations:

  1. Can I order
  2. I would like to order
  3. Place an order for

Lexicon Values:

  1. Pizza
  2. Burger
  3. Noodles
  4. Pasta

Lexicon type is “fastfood”

While we need a significant amount of training data for a good conversational experience, manually creating all of it isn’t the most efficient way out. So instead of writing text preprocessing examples manually, you can write one or several template files in a specific format. Then you can run the data generator tools which parses templates and outputs ready-to-use examples.

Training dataset adaptor Chatito

One of the useful tools for generating the synthetic training dataset for the NLU model is Chatito. Now I will walk you through the process of installation and data creation using Chatito. As a first step, install Node.js and Python. Chatito can be installed via npm:

npm install chatito

Input to Chatito is a template file where we define sample templates of user utterance. We can also provide the count of training and test data to be generated by the chatito adapter. Below is the sample template:

Template file:

Chatito Module ingests the sample template and generates training data.

The output is a json file having training examples along with additional information like slots and lexicons.

%[orderFood](‘training’: ‘120’, ‘testing’:’10') ~[order] @[foodItem] ~[order] I would like to order a Order me a Order I wanna order Can I order @[foodItem](‘lexicon’: ‘fastfood’) Pizza Burger Noodles Chowmein Tacos

view raworderfood intent hosted with ❤ by GitHub

In this example, we have added an intent “orderFood” and user utterance sample. We can provide different variations of the utter utterance sample as shown above. Here all @[foodItem] values are tagged by “fastfood” lexicon.

The following command is executed to generate training examples using Chatito:

npx chatito <pathToTemplateFile> --format=default

After executing the command, the training dataset is generated. Here is the sample of training dataset:

{ “orderFood”: [ [ { “type”: “Text”, “value”: “I would like to order a “ }, { “value”: “Pizza”, “type”: “Slot”, “args”: { “lexicon”: “fastfood” }, “slot”: “foodItem” } ] …. ….

view rawsample training dataset hosted with ❤ by GitHub

Chatito to Cognigy training data conversion script

The chatito training data is preprocessed in accordance with the standards accepted by Cognigy via Python Script.

Here is the snippet for preprocessing the dataset:

chatito2cognigy.py

import json import sys import xlsxwriter class Chatito2Cognigy: “”” Generates the training dataset based on Chatito training dataset “”” def __init__(self, input_path, output_path, intent): self._chattete_path = input_path self._cognigy_path = output_path self._intent = intent def _write_to_csv(self): workbook = xlsxwriter.Workbook(self._cognigy_path) worksheet = workbook.add_worksheet() f = open(self._chattete_path, ) data = json.load(f) intent_name = self._intent row = 0 for data in data[intent_name]: training_ex = ‘’ for i in data: if i[‘type’] == ‘Slot’: i[‘value’] = ‘[{0}]’.format(i[‘value’]) training_ex = training_ex + i[‘value’] worksheet.write(row, 0, intent_name) worksheet.write(row, 1, ‘exampleSentence’) worksheet.write(row, 2, training_ex) row = row + 1 f.close() workbook.close() def write_intents(self): self._write_to_csv() def main(): input_template_path = sys.argv[1] output_path = sys.argv[2] intent = sys.argv[3] gen_train = Chatito2Cognigy(input_template_path, output_path, intent) gen_train.write_intents() if __name__ == ‘__main__’: main()

view rawchatito2cognigy hosted with ❤ by GitHub

Command to run the Python Script

python chatito2cognigy.py <pathToChatitoTrainingFile> <pathToOutputCsvFile> <intentName>

Output : CSV File

This csv file can be uploaded in the Intent Section of Cognigy.

Hence by running the script, one can easily sit back and watch the tedious work done automatically.

Performance analysis

After training the model with synthetic data, we can determine the performance of the model.

The percentage of Intent Classification and Slot Filling will determine the accuracy of the model. In the early stage of virtual agent development, the threshold for interpreting the probabilities of the class is kept high. This way we can improve the NLU model. A benchmark can be created by building a test dataset with a significant number of examples (eg >100) using real-time customer data.

Here is a benchmark article, comparing F1 Score of different conversational AI providers by using open-source dataset on Github. The dataset comprises 2400 queries for each of the 7 user intents they tested.

Natural language generation

We can improvise the templates by adding an NLG component on top of Chatito. The NLG process receives the information from the chatito parser and generates a number of sentences with the same meaning. The NLG component comprises of following :

  • Word order
  • Word Level Grammatical Functions
  • Singular/plural
  • Questions
  • Word Similarity and Synonyms

Thus we can generate rich conversational datasets for training. This approach provides a solution that automates the most resource-intensive task. Hence, developers need not have to do repetitive manual work.

Conclusion

The lack of a flexible dataset often limits one’s ability to explore the vast capabilities of NLU. But tools like Chatito can immensely help in generating rich synthetic datasets. Furthermore, we also discussed how we can improvise the dataset generated by Chatito so as to make it compatible with Cognigy’s standard.

Referred Links:

https://scholar.google.co.uk/citations?user=Ns0YuP0AAAAJ

https://rodrigopivi.github.io/Chatito/

Originally published at https://www.cognigy.com.

--

--

Raul Tiru

Co-founder https://StoryLab.ai, Founder https://GlobalOwls.com. Let’s create memorable content. #ContentCreation, #ContentMarketing, #Nonprofit