Salesforce AI Dataset Optimization Guide
Einstein needs great data to make great predictions.
Use this guide to create better einstein.ai training datasets.
General Dataset Optimization Best Practices
Time for Datasets
Plan for and allocate adequate time in your Einstein project for dataset preparation. While quantity of examples is certainly an aspect, the quality of those examples is what will bring your AI Lab infused projects to the next level
If you are quickly producing datasets, you'll find:
-
Too few examples usually make for inferior training data as Einstein won't have enough data to learn from.
-
Too many examples usually don't improve training data as the lines between data classifications begin to blur.
Use Existing Data
Leverage data in your Salesforce org. Determine what data exists and best matches your use cases. Find and use data that was created by your customers and classified by your employees. If using Service Cloud, this data might be from email-to-case or web-to-case populated fields. If using Sales Cloud, look at the Lead fields populated by web forms. Any use case is supported and will work best when the source data used by the Einstein.ai training engines reflect the language you'll be asking Einstein to predict and classify for you later.
Read On
Once you've identified the data, you're not done yet. Export it via Reports or your favorite data loader tool and then use the tips in the rest of this guide to help massage its labels and examples to become a higher performing dataset to ensure your AI integration efforts are off to the
best start possible.
Definitions
Datasets
Datasets are collections of text that contains example strings and labels to be used by Einstein.ai for Model creation and training.
Examples
Datasets are made up of strings of text (examples) that the Einstein.ai training engine uses. Those examples are sent to Einstein.ai to learn from so as future requests are made from your org to Einstein with similar text, it can better predict what Label should be returned to you and how likely it is to be correct.
Sample Examples
Strings derived from Service Cloud might include:
-
“My kitchen outlet is not working”
-
“The carpets need cleaning.
Labels
When creating a dataset, each example provided will need to be associated with a description. Labels are also commonly referred to as classes or classifications, types, tags, groups, and nearly any other word synonymous with “label.”
Sample Labels
Service Cloud oriented labels might be derived from the Case “Type” field:
-
“Maintenance”
-
“Plumbing.”
Requirements
Einstein.ai API endpoints have certain requirements that should be adhered to when creating datasets.
Labels
-
Each dataset needs at least 2 labels for dataset creation to be successful
-
Each label should be no more than 180 characters
Examples
-
Depending on the dataset type, a minimum amount of examples must be met
- Intent: 5 examples
-
Sentiment: 100 examples
-
Avoid duplicate examples within a dataset, even across labels, as only the first loaded by Einstein.ai will be used
Get Better Prediction Probabilities
Even the cleanest dataset doesn't guarantee accuracy. Here are a few tricks that have proven helpful in achieving better prediction accuracy.
Data Recency
When you created your dataset, where was the data sourced from? That's a common consideration when starting your efforts, but an often missed question is "when" is that data from? Think about how your business changes from year to year or perhaps even month to month. So does your data. Customers interact with you different based on the services you provide and when analyzed throughout time, these interactions lead to content that is also constantly morphing to fit your business' daily operations.
Datasets sourced from historic data, like old Case descriptions and chatter may simply just no longer apply. When now dated and historic examples make their way into your datasets, they are muddying the water and blurring the lines of how Einstein.ai is learning from and predicting against your most recent operations.
Take a look at your data and ensure the example data sourced is recent and mirrors your current business strategies, lines of business, and product channels.
Uneven Example Distribution
Another common problem is when a dataset includes one or more labels whose number of examples are dwarfed in count when compared to others. This uneven distribution means that Einstein.ai is unable to learn as much about these smaller example sets than the larger ones.
You can determine this by
-
Determine how many examples exist in the dataset (total examples)
-
Determine how many examples exist for each label (total examples per label)
-
Divide each label’s example count against the total number of examples in the dataset
Using this percentage, you can can determine which labels need additional examples or alternatively, should be dropped from the dataset. Start by looking at the labels with the greatest amount of difference.
If dropping, follow this procedure to find the optimal dataset:
-
Start by removing the label and corresponding examples with the lowest % first. Ensure that you save this dataset file separately so it can be used in the future, if needed.
-
Create a Prediction Flow w/ this revised dataset and allow it to train
-
Once training has completed, observe the Prediction Accuracy within the AI Lab Prediction Flow Card.
-
Again revise the dataset to remove the next lowest label and its examples. Ensure that you save this dataset file separately so it can be used in the future, if needed.
-
Repeat the Prediction Flow creation step and observe any improvement