Home / Services / AI Data Services / Optimizing the AI Data Pipeline

2025-09-10

Optimizing the AI Data Pipeline: From Bottleneck to Performance Driver

Insights from industry leaders on tackling the hidden costs, compliance challenges, multilingual scaling, and pipeline optimization with AI Data.

blue-wave-technology-information-background

Why the Right AI Data Matters

Behind every AI breakthrough is one simple truth: without the right data and rigorous AI data analysis to fuel it, even the smartest models fail. Bad data doesn't just slow projects: it destroys trust, wastes investment, and puts entire strategies at risk. In a world where Artificial Intelligence is quickly moving from experimental to essential, understanding how to fix your data foundations is the difference between leading and falling behind.

To achieve true scale and make reliable data-driven decisions, organizations must address the quality, compliance, and multilingual complexity of their large datasets right at the source.

We invited Sam Shamsan (CEO of Blomega) and Agustín Da Fieno Delucchi (Director of Globalization AI and Data Science at Microsoft) to share their expert insights into how to effectively use data to power AI models, a discussion moderated by Acolad's AI Data Program Manager, Jennifer Nacinelli.

The discussion, “AI's Data Secret: Why Poor Data is Killing Your Models – and How to Fix It”, went from how poor data undermines AI investments to compliance, multilingual data, and the "75% rule", that most AI effort goes into data preparation, not model building.

Key topics covered:

The hidden costs of poor data quality
Turning compliance requirements into competitive advantage
The importance of multilingual and culturally relevant data
Why 75% of AI success lies in data preparation

“The bad data dilemma… it misleads the AI model. It doesn’t only fail to guide it, but it pushes it into a direction that is not desirable at all.”

Sam Shamsan-1

Sam Shamsan

CEO, Blomega

The Multiplier Effect on Large Datasets

We've all heard of the classic principle of garbage in, garbage out, which is particularly important when working with data for AI models. But beyond that, there's also a multiplier effect to take into account.

As Agustín Delucchi explained, mistakes in transformation stages don't just waste effort, they amplify risks and lead to systemic AI failures. For the Microsoft Director of Globalization AI and Data Science, the rule is clear: "about 75% of the effort is really in building models.”

“In multilingual projects this is even more obvious: if the data is uneven across languages, performance drops quickly and the client feels it. I have had to step into projects where the budget was drained not by model training or complex pipelines, but by fixing the data after the fact. That is the real hidden cost.”

Jennifer Nacinelli

Jennifer Nacinelli

AI Data Program Manager, Acolad

Regulations as Market Gateways

A crucial factor for many organizations, particularly in regulated industries, and certain geographies, is that AI tech compliance is not optional.

Regulations like the EU's Articial Intelligence Act are now bringing in new hoops for many businesses hoping to harness the power of AI.

All this means that it's vitally important to build-in compliance with those processes right from the start - with how you gather, maintain and refine your data. If you get it right, you can benefit from market access that rivals could miss out on.

Three Dimensions of Compliance

Sam identified three crucial dimensions of AI compliance:

Customer privacy – Users share deeply personal data with AI tools, requiring strict safeguards.
Training data sourcing – Early LLMs scraped the internet freely, but licensing is now becoming critical as regulations tighten in some regions.
Highly regulated industries – Healthcare and finance demand sector-specific oversight.

As Sam noted, “AI is not a product anymore, it’s an ecosystem.” Navigating compliance across these dimensions is a key differentiator for AI providers — essentially the foundation of strong AI data governance.

“Sooner rather than later, [regulations] are going to become entry points for a market. If you want to be really in the business, you’re gonna have to do it.”

Agustín Da Fieno Delucchi

Agustín Da Fieno Delucchi

Director of Globalization AI and Data Science, Microsoft

Discover how a global tech leader scaled AI speech recognition with 120+ hours of diverse, real-world voice data.

“Most of the world does not think or communicate in English. When the model ignores that, they miss not only opportunity, but context and culture and a huge chunk of people.”

Sam Shamsan-1

Sam Shamsan

CEO, Blomega

Context is King

When it comes to using LLMs for language-related tasks, Agustín argued that their real strength lies in their ability to transfer meaning across contexts, not just languages.

He highlighted the opportunity for organizations to reshape translation memories and linguistic assets into richer, contextualized data for AI training.

The context-adaptability of LLMs is also crucial for highly-context specific tasks beyond the written word, for example its capabilities with audio and video formats.

“One of the biggest misconceptions I deal with is the idea that English data is enough, and a translation of it will suffice. I manage projects every day where clients are rolling out AI solutions globally, and the results are very clear: a model trained in English might work fine in the US, but it fails when you put it in front of users in Germany, Brazil, or Korea.”

Jennifer Nacinelli

Jennifer Nacinelli

AI Data Program Manager, Acolad

“75% of [the effort] is going to be preparing the data, making sure that we minimize and reduce as much as possible the bias.”

Agustín Da Fieno Delucchi

Agustín Da Fieno Delucchi

Director of Globalization AI and Data Science, Microsoft

Optimizing Data Pipelines

Many businesses can feel that their data pipeline is the bottleneck, and Jennifer says how she herself has felt that frustration in projects, though she points out that everything can change when the pipeline is redesigned with efficiency in mind.

For example, by improving the sourcing of speakers, annotation, validation, and managing throughput, what can often be the slowest part of delivery can become the part that actually drives speed. Our AI Data Program Manager gives a very clear example:

"In one case, a client expected us to take weeks to process multilingual audio, and we got it done in days because the pipeline was structured well. For me, the pipeline is not the back office of AI, it is the engine that keeps everything moving."

“On some of our healthcare and finance projects, it has been human reviewers who caught subtle but critical errors that the system would never have flagged. These are not “nice-to-have” checks, they are the reason the outputs are trusted. Human-in-the-loop is what makes AI usable in the real world.”

Jennifer Nacinelli

Jennifer Nacinelli

AI Data Program Manager, Acolad

Precision at Source Equals Data-Driven Decisions and Performance at Scale

AI success depends on high-quality, well-prepared data from the start. Compliance, inclusivity, and optimization aren’t costs to be minimized—they are levers for performance and trust.

For leaders, the takeaway is clear: investing in your data pipeline is not just about reducing risk, it’s about enabling scale, market entry, and customer confidence.

Key Takeaways:

Audit your data – Poor quality silently undermines performance and trust.
Treat compliance as strategy – Regulations can become a market advantage.
Invest in multilingual data – English-only approaches limit growth.
Prioritize preparation – Remember the 75% rule: most effort is in data.
Build human-in-the-loop systems – Trust requires validation and oversight.

Optimizing the AI Data Pipeline: From Bottleneck to Performance Driver

Article

Training Multilingual AI with Real-World Voice Data

Success Story

Optimizing the AI Data Pipeline: From Bottleneck to Performance Driver

Why the Right AI Data Matters

Key topics covered:

AI Data Quality: Hidden Costs and Business Risks

Bad Data Multiplies Downstream

The Multiplier Effect on Large Datasets

The Multilingual Impact of Bad Data

AI Data Compliance and Governance as Business Advantage

Reframing AI Data Compliance

Regulations as Market Gateways

Three Dimensions of Compliance

Training Multilingual AI with Real-World Voice Data

Beyond English-only: Scaling Globally with Diverse Data Sources

Market Reach and Cultural Relevance

Context is King

Delivering Genuinely Multilingual Data

Optimizing Your AI Data Pipeline: 75% Rule

Preparing Data is the Hard Part

Optimizing Data Pipelines

The Importance of an AI Agent

Why the Human-in-the-Loop Still Matters

Precision at Source Equals Data-Driven Decisions and Performance at Scale

Key Takeaways:

Ready to Build AI Efficiency With Quality Data?

Related Resources

Working on international projects?

Company

Resources

Connect

Legal