Always Be Evaluating

Low-code tools are going mainstream

Purus suspendisse a ornare non erat pellentesque arcu mi arcu eget tortor eu praesent curabitur porttitor ultrices sit sit amet purus urna enim eget. Habitant massa lectus tristique dictum lacus in bibendum. Velit ut viverra feugiat dui eu nisl sit massa viverra sed vitae nec sed. Nunc ornare consequat massa sagittis pellentesque tincidunt vel lacus integer risu.

Vitae et erat tincidunt sed orci eget egestas facilisis amet ornare
Sollicitudin integer velit aliquet viverra urna orci semper velit dolor sit amet
Vitae quis ut luctus lobortis urna adipiscing bibendum
Vitae quis ut luctus lobortis urna adipiscing bibendum

Multilingual NLP will grow

Mauris posuere arcu lectus congue. Sed eget semper mollis felis ante. Congue risus vulputate nunc porttitor dignissim cursus viverra quis. Condimentum nisl ut sed diam lacus sed. Cursus hac massa amet cursus diam. Consequat sodales non nulla ac id bibendum eu justo condimentum. Arcu elementum non suscipit amet vitae. Consectetur penatibus diam enim eget arcu et ut a congue arcu.

Vitae quis ut luctus lobortis urna adipiscing bibendum

Combining supervised and unsupervised machine learning methods

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

Dolor duis lorem enim eu turpis potenti nulla laoreet volutpat semper sed.
Lorem a eget blandit ac neque amet amet non dapibus pulvinar.
Pellentesque non integer ac id imperdiet blandit sit bibendum.
Sit leo lorem elementum vitae faucibus quam feugiat hendrerit lectus.

Automating customer service: Tagging tickets and new era of chatbots

“Nisi consectetur velit bibendum a convallis arcu morbi lectus aecenas ultrices massa vel ut ultricies lectus elit arcu non id mattis libero amet mattis congue ipsum nibh odio in lacinia non”

Detecting fake news and cyber-bullying

Nunc ut facilisi volutpat neque est diam id sem erat aliquam elementum dolor tortor commodo et massa dictumst egestas tempor duis eget odio eu egestas nec amet suscipit posuere fames ded tortor ac ut fermentum odio ut amet urna posuere ligula volutpat cursus enim libero libero pretium faucibus nunc arcu mauris sed scelerisque cursus felis arcu sed aenean pharetra vitae suspendisse ac.

Compared to the previous generation of AI and ML algorithms, generative AI models like LLMs or multimodal models capable of open-ended responses with unstructured text, image, or video have introduced new industry challenges for quality evaluations.

When considering an image classifier, determining "correctness" is a straightforward application of simple logic: did you predict the ground-truth image label? When evaluating an LLM, responses are more varied, evaluations are more subjective, and you're likely tasking the model to balance different capabilities in your application.

For reproducibility in research, the community organizes around open benchmarks designed to assess specific capabilities. As foundation models evolve from the labs into production software systems, the capabilities that address customer needs are prioritized.

Benchmarks are the easiest way to rein in the scope of candidate AI models from the 1.5 million open-weight models on the hub to something more manageable for consideration in your next experiment.

However, open research benchmarks are subject to "benchmark hacking," where unprincipled model providers release weights developed to take advantage of publicity by topping the leaderboards. All this is even though the AI research community likely does not represent your user base.

One of the most reliable ways to use benchmarks is as a "guardrail metric" for your AI model treatments. After comparing the base model with your finetuning or inference optimizations, do you observe a significant degradation of a key skill or capability as indicated by a dip in a benchmark score?

Recent flagship AI models like Grok 3, GPT-4.5, and now Lama 4 were released despite lackluster research benchmark performance. Each emphasizes "better vibes," conversational skills, and a neutralized moral/political bias.

What matters for users of your AI application?

It's never been easier to assess your models for your application as data synthesis and auto evaluators have reduced custom evaluation to prompting a tool with a bit of context about your application and data.

To consistently succeed with your AI initiatives, it's essential to recognize that any benchmark, even one customized with your application in mind, will still only be a proxy to the metrics you care to optimize, related to customer satisfaction and business value. Thus, these offline metrics offer a skewed view of AI model fitness. Unfortunately, they are often poor predictors of what will delight your users.

So, how do you make sense of this to build great AI products?

I advocate for evaluating every step of the experiment! Review the reported benchmarks, check the model's Elo ratings on the chatbot arena, and see what developers report on r/localllama. Run your custom fine-grained skills assessments, compare the loss amongst your finetunes, and score and filter your training samples for quality. Measure latency and any other metrics that will help you to confidently deploy your best treatments for users to assess in an online experiment.

With millions of open-weight models and thousands of benchmarks, it's easy to find false positives in your search for the best AI. Don't just use metrics that support a narrative created after the fact; choose your metrics ahead of time with a hypothesis in mind. Your defense against institutionalizing "bad intel" into your AI initiative is to raise your scientific standards in AI evaluation.

Discovering AI models that excel across several unrelated evaluation methods, you can reason with more confidence and explainability to get consensus around your next AI experiment. The online, end-to-end evaluation helps you build knowledge about what delights users of your AI. The final launch/no-launch decisions around the next AI update should be data-driven after proving to impact meaningful business metrics positively.

So, what's your hypothesis?

Agile AI engineering with an integrated development and experiment platform.

Low-code tools are going mainstream

Multilingual NLP will grow

Combining supervised and unsupervised machine learning methods

Automating customer service: Tagging tickets and new era of chatbots

Detecting fake news and cyber-bullying

Popular posts

Remyx - Your AI Production Assistant 💡

Agile AI Engineering

Latest articles

Remyx - Your AI Production Assistant 💡

Agile AI Engineering

Trustworthy AI Experiments

Let's Talk AI