Trustworthy AI Experiments

Low-code tools are going mainstream

Purus suspendisse a ornare non erat pellentesque arcu mi arcu eget tortor eu praesent curabitur porttitor ultrices sit sit amet purus urna enim eget. Habitant massa lectus tristique dictum lacus in bibendum. Velit ut viverra feugiat dui eu nisl sit massa viverra sed vitae nec sed. Nunc ornare consequat massa sagittis pellentesque tincidunt vel lacus integer risu.

Vitae et erat tincidunt sed orci eget egestas facilisis amet ornare
Sollicitudin integer velit aliquet viverra urna orci semper velit dolor sit amet
Vitae quis ut luctus lobortis urna adipiscing bibendum
Vitae quis ut luctus lobortis urna adipiscing bibendum

Multilingual NLP will grow

Mauris posuere arcu lectus congue. Sed eget semper mollis felis ante. Congue risus vulputate nunc porttitor dignissim cursus viverra quis. Condimentum nisl ut sed diam lacus sed. Cursus hac massa amet cursus diam. Consequat sodales non nulla ac id bibendum eu justo condimentum. Arcu elementum non suscipit amet vitae. Consectetur penatibus diam enim eget arcu et ut a congue arcu.

Vitae quis ut luctus lobortis urna adipiscing bibendum

Combining supervised and unsupervised machine learning methods

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

Dolor duis lorem enim eu turpis potenti nulla laoreet volutpat semper sed.
Lorem a eget blandit ac neque amet amet non dapibus pulvinar.
Pellentesque non integer ac id imperdiet blandit sit bibendum.
Sit leo lorem elementum vitae faucibus quam feugiat hendrerit lectus.

Automating customer service: Tagging tickets and new era of chatbots

“Nisi consectetur velit bibendum a convallis arcu morbi lectus aecenas ultrices massa vel ut ultricies lectus elit arcu non id mattis libero amet mattis congue ipsum nibh odio in lacinia non”

Detecting fake news and cyber-bullying

Nunc ut facilisi volutpat neque est diam id sem erat aliquam elementum dolor tortor commodo et massa dictumst egestas tempor duis eget odio eu egestas nec amet suscipit posuere fames ded tortor ac ut fermentum odio ut amet urna posuere ligula volutpat cursus enim libero libero pretium faucibus nunc arcu mauris sed scelerisque cursus felis arcu sed aenean pharetra vitae suspendisse ac.

AI is an empirical discipline. It's impossible to know which frontier model API, open-weight model, or combination of datasets with fine-tuning regimens will perform best for your application.

Model providers don't disclose much about the data they use for pretraining, making it more challenging to determine the actual model capabilities.

The Open LLM Leaderboard was an attempt to establish a scoring system to help you decide which foundation model would meet your requirements from over one million options. However, this created adverse incentives for fine-tunes to overfit the benchmarks as the leaderboard ranking of any other model rendered it invisible. Ultimately, the leaderboard has been retired.

Developers still need to limit the scope of models they can consider for their application. The greatest simplification comes by focusing on base models created by a trusted authority. The research labs raising the bar in open science create the base models everyone else relies on and compares against.

However, many practitioners find they can improve the performance of these top-tier models by curating better, more specialized data for additional fine-tuning. This emerging best practice is apparent in most AI whitepapers as teams demonstrate a new state-of-the-art in a specific domain. Researchers highlight a comparison featuring the most domain-relevant metrics and models across parameter sizes.

Top of this pyramid, the "Real-World Data," is where you will can chase the 9s in performance of your FM

Your bot, your environment, your camera, your lighting will all be a little different than mine

Are you gonna leave that on the table?
Cuz I'll eat it 🤖 LUNCHTIME! pic.twitter.com/q4sW7fk2ku
— Smells Like ML (@smellslikeml) March 18, 2025

The emergence of more fine-grained benchmarks and even custom evaluations using LLMs and synthetic data have made it cheap and easy to assess your models. Therefore, practitioners are left to decide which skills and capabilities are essential to the user experience of their AI app.

You could evaluate your model using an evaluation harness that includes all the tasks, but you'll run a greater risk of false positives when you don't establish clear criteria before reviewing the evaluation results. These are manifestations of the curse of dimensionality and content overload in the space of possible AI engineering experiments. Like the "multiple testing problem" with enough metrics, your model is likely an extreme case for some benchmark.

Even when an LLM scores well on benchmark testing skills and capabilities you value for your application, offline metrics like these are unreliable predictors for metrics you want to optimize related to user engagement and business value. However, offline evaluations can help reduce the search space and experiment more cheaply. Online evaluation with A/B testing takes time and carries the risks of showing your users a feature that you'll decide not to ship.

Engineers may prefer to see candidate AI models that dominate across various metrics. Robust performance like this is especially noteworthy when the skills assessed are unrelated. In this scenario, one can be confident that the treatment represents a real improvement over a baseline and is ready for review by real users.

The best way to know how users will respond to your new AI app update is to deploy the proposed feature and compare it against the current production version using a randomized controlled experiment. A/B tests are the gold-standard evaluation for your software updates, and testing changes to the underlying foundation models that power your AI app are no exception. Ramping up to that launch, you prune the less promising candidate treatments using data gathered with offline evaluations before your experiment.

Most "great ideas" in software engineering cannot be associated with a measurable improvement in business metrics or user satisfaction. The cost of adding a feature that you can't justify with a successful A/B test is the continued maintenance for something that may not matter. Even worse, your AI program could be steered off course for long after institutionalizing bad intel.

Generative AI has compressed the knowledge of the entire internet into LLM weights. Thus, open weights have commodified both model and data. AI product defensibility lies in curating your knowledge to combine these resources to create the most differentiated and delightful experience for your users. It's about discovering the methods and strategies that transcend today's technical minutiae to deliver enduring value.

Unfortunately, there is no silver bullet for AI engineering. No single benchmark will help you align AI with your users. You can start with a clear definition for your north-star metrics, instrument your AI application, and invest in reducing your experiment iteration time so you can converge on the knowledge it takes to optimize your AI. You can collaborate for consensus around the next great experiment and socialize the findings to foster a culture of continuous experimentation in AI engineering.