blog from Aveni

Safer by design: FinLLM’s safety edge in financial AI

Outperforming the LLM giants

Share this resource
company
by Aveni
| 31/07/2025 09:00:00

Safety is fast becoming the defining benchmark for deploying AI in financial services.

Accuracy alone is no longer enough. Financial firms face unique risks when using large language models (LLMs), from the danger of misinformation to regulatory breaches and biased outcomes that can harm customers or damage trust. That is why, at Aveni, we have made safety central to the design and development of FinLLM.

Our latest safety benchmarking shows that FinLLM does not just meet expectations, it exceeds them, outperforming several larger, well-known models across key safety metrics that matter most in financial services contexts. And we are doing it through a targeted, data-driven approach that goes beyond generic alignment to tackle the real-world risks financial firms care about.

Why safety in FS LLMs is different

Traditional AI safety benchmarks often focus on generic risks like toxicity or hallucination. While these are relevant, they do not go far enough for financial services. The stakes are higher: inaccurate advice, misleading information, or biased outputs could lead to poor customer outcomes, regulatory breaches or even financial harm.

That is why FinLLM’s safety benchmarking is designed specifically around financial use cases. We assess our models not just on general safety but across a set of FS-specific risk categories, including:

  • Toxicity: avoiding harmful or offensive content, especially in sensitive financial contexts.
  • Bias: preventing discriminatory or skewed outputs.
  • Misalignment: ensuring outputs align with firm policies, values, and regulatory expectations.
  • Misinformation: reducing incorrect or misleading information that could cause harm.
  • Hallucination: minimising incorrect or fabricated responses.
  • Privacy and IP: protecting personal data and intellectual property rights.

By measuring FinLLM against each of these distinct categories, we ensure that any improvements in one area do not come at the cost of another. This approach helps us avoid the common AI pitfall where, for example, reducing harm inadvertently increases refusal rates or introduces unintended bias.

The results: FinLLM outperforms on safety, even against bigger models

In our latest round of evaluations, we benchmarked FinLLM against a range of popular open-source instruct models including OLMo 7B, Qwen 8B, and Granite 8B. We tested three different versions of FinLLM, each at various stages of training and alignment, using industry-standard safety datasets.

The headline: FinLLM 7B SFT outperformed all other models on average safety scores, achieving 69.37, ahead of models with similar r parameter counts.

Here is how it compares:

Model Average safety score
FinLLM 7B SFT:Safety 69.37
FinLLM 7B SFT v1* (Prompt)      67.69
FinLLM 7B SFT v1 62.33
Granite 3.38B 67.31
Qwen3 8B 65.00
OLMo 2 7B 66.80
Mistral 7B 58.66

 

This demonstrates FinLLM’s ability to generate safer, more reliable outputs in financial contexts, outperforming similar sized open models.

 

Our method: a data-driven approach to AI safety

At Aveni, we evaluate FinLLM at every training stage, pre-training, supervised fine-tuning (SFT), and post-training (reinforcement learning,  DPO), using the same rigorous benchmarks. This continuous evaluation allows us to identify gaps in performance early and take corrective action.

Importantly, we do not rely on a one-size-fits-all approach. Each risk category is treated as a separate evaluation track with its own dedicated datasets. This separation is vital because risks are often interdependent. For instance, addressing toxicity without nuanced understanding could unintentionally introduce bias. By isolating each risk, we are able to fine-tune FinLLM more precisely and safely.

We have also adopted a data-driven strategy for improvement: gaps identified through benchmarking inform the selection of new training datasets. For example, we observed that FinLLM currently underperforms slightly in bias detection, as measured by datasets like BBQ and BOLD. In response, we are expanding our training corpus in this area for the next development cycle.

Applied safety for financial services

What does this mean for financial firms? It means that FinLLM is not only safer in principle but safer in practice, measured, monitored, and continuously improved using financial services-relevant benchmarks.

We tested two safety mitigation strategies: system prompt engineering (FinLLM 7B SFT v1* (Prompt)) and supervised fine-tuning (FinLLM 7B SFT:Safety). Both approaches showed material improvements, and when combined, they have enabled FinLLM to surpass several competitors. Soon, we will be adding new bias evaluation datasets and exploring further alignment techniques to continue raising the bar.

Safer AI, smarter outcomes

The financial services industry cannot afford to deploy LLMs that are merely “good enough.” Safety, accuracy, and alignment must be built in by design. With FinLLM, we are proving that it is possible to deliver this without needing massive models that come with prohibitive costs or risks.

Our latest safety benchmarking shows that size is not everything, precision, alignment, and domain expertise matter just as much. FinLLM offers a smarter, safer path forward for AI in financial services.

Read the original article here.