top of page

Salesforce AI Dataset Optimization Guide

Einstein needs great data to make great predictions.

Use this guide to create better training datasets.

General Dataset Optimization Best Practices

Time for Datasets

Plan for and allocate adequate time in your Einstein project for dataset preparation.  While quantity of examples is certainly an aspect, the quality of those examples is what will bring your AI Lab infused projects to the next level


If you are quickly producing datasets, you'll find:​

  • ​Too few examples usually make for inferior training data as Einstein won't have enough data to learn from.

  • Too many examples usually don't improve training data as the lines between data classifications begin to blur.

Use Existing Data

Leverage data in your Salesforce org. Determine what data exists and best matches your use cases.  Find and use data that was created by your customers and classified by your employees.  If using Service Cloud, this data might be from email-to-case or web-to-case populated fields.  If using Sales Cloud, look at the Lead fields populated by web forms.  Any use case is supported and will work best when the source data used by the training engines reflect the language you'll be asking Einstein to predict and classify for you later.

Read On

Once you've identified the data, you're not done yet.  Export it via Reports or your favorite data loader tool and then use the tips in the rest of this guide to help massage its labels and examples to become a higher performing dataset to ensure your AI integration efforts are off to the

best start possible.



Datasets are collections of text that contains example strings and labels to be used by for Model creation and training.


Datasets are made up of strings of text (examples) that the training engine uses. Those examples are sent to to learn from so as future requests are made from your org to Einstein with similar text, it can better predict what Label should be returned to you and how likely it is to be correct.  

Sample Examples

Strings derived from Service Cloud might include:

  • “My kitchen outlet is not working”

  • “The carpets need cleaning.


When creating a dataset, each example provided will need to be associated with a description. Labels are also commonly referred to as classes or classifications, types, tags, groups, and nearly any other word synonymous with “label.”  

Sample Labels

Service Cloud oriented labels might be derived from the Case “Type” field:

  • Maintenance

  • Plumbing.”

Requirements API endpoints have certain requirements that should be adhered to when creating datasets. 


  1. Each dataset needs at least 2 labels for dataset creation to be successful

  2. Each label should be no more than 180 characters


  1. Depending on the dataset type, a minimum amount of examples must be met

    • Intent:  5 examples
    • Sentiment:  100 examples

  2. Avoid duplicate examples within a dataset, even across labels, as only the first loaded by will be used

Get Better Prediction Probabilities

Even the cleanest dataset doesn't guarantee accuracy.  Here are a few tricks that have proven helpful in achieving better prediction accuracy.

Data Recency

​When you created your dataset, where was the data sourced from?  That's a common consideration when starting your efforts, but an often missed question is "when" is that data from?  Think about how your business changes from year to year or perhaps even month to month.  So does your data.  Customers interact with you different based on the services you provide and when analyzed throughout time, these interactions lead to content that is also constantly morphing to fit your business' daily operations.

Datasets sourced from historic data, like old Case descriptions and chatter may simply just no longer apply.  When now dated and historic examples make their way into your datasets, they are muddying the water and blurring the lines of how is learning from and predicting against your most recent operations.  

Take a look at your data and ensure the example data sourced is recent and mirrors your current business strategies, lines of business, and product channels.

Uneven Example Distribution

Another common problem is when a dataset includes one or more labels whose number of examples are dwarfed in count when compared to others.  This uneven distribution means that is unable to learn as much about these smaller example sets than the larger ones.


You can determine this by

  1. Determine how many examples exist in the dataset (total examples)

  2. Determine how many examples exist for each label (total examples per label)

  3. Divide each label’s example count against the total number of examples in the dataset

Using this percentage, you can can determine which labels need additional examples or alternatively, should be dropped from the dataset.  Start by looking at the labels with the greatest amount of difference.

If dropping, follow this procedure to find the optimal dataset:

  1. Start by removing the label and corresponding examples with the lowest % first.  Ensure that you save this dataset file separately so it can be used in the future, if needed.

  2. Create a Prediction Flow w/ this revised dataset and allow it to train

  3. Once training has completed, observe the Prediction Accuracy within the AI Lab Prediction Flow Card.

  4. Again revise the dataset to remove the next lowest label and its examples.  Ensure that you save this dataset file separately so it can be used in the future, if needed.

  5. Repeat the Prediction Flow creation step and observe any improvement

Jump start your artificial intelligence journey today!

AI Lab provides easy access to the powerful Salesforce AI APIs.  

  • Build custom predictions.

  • Use data in your org.

  • Append post-prediction process.

bottom of page