2025-09-10

Optimizing the AI Data Pipeline: From Bottleneck to Performance Driver

Insights from industry leaders on tackling the hidden costs, compliance challenges, multilingual scaling, and pipeline optimization with AI Data.

Why the Right AI Data Matters

Behind every AI breakthrough is one simple truth: without the right data and rigorous AI data analysis to fuel it, even the smartest models fail. Bad data doesn't just slow projects: it destroys trust, wastes investment, and puts entire strategies at risk. In a world where Artificial Intelligence is quickly moving from experimental to essential, understanding how to fix your data foundations is the difference between leading and falling behind.

To achieve true scale and make reliable data-driven decisions, organizations must address the quality, compliance, and multilingual complexity of their large datasets right at the source.

We invited Sam Shamsan (CEO of Blomega) and Agustín Da Fieno Delucchi (Director of Globalization AI and Data Science at Microsoft) to share their expert insights into how to effectively use data to power AI models, a discussion moderated by Acolad's AI Data Program Manager, Jennifer Nacinelli.

The discussion, “AI's Data Secret: Why Poor Data is Killing Your Models – and How to Fix It”, went from how poor data undermines AI investments to compliance, multilingual data, and the "75% rule", that most AI effort goes into data preparation, not model building.

Key topics covered:

  • The hidden costs of poor data quality
  • Turning compliance requirements into competitive advantage
  • The importance of multilingual and culturally relevant data
  • Why 75% of AI success lies in data preparation

AI Data Quality: Hidden Costs and Business Risks

Bad Data Multiplies Downstream

Bad data doesn’t just produce wrong results, it misleads entire AI systems. Sam Shamsan explained how mislabeled data can spiral into major business risks. He gave the example of how a misclassified banking dispute can cause genuine fraud to be downplayed while harmless queries get flagged as fraud, damaging both customer trust and brand reputation.

“The bad data dilemma… it misleads the AI model. It doesn’t only fail to guide it, but it pushes it into a direction that is not desirable at all.”

Sam Shamsan-1


Sam Shamsan 

CEO, Blomega

The Multiplier Effect on Large Datasets

We've all heard of the classic principle of garbage in, garbage out, which is particularly important when working with data for AI models. But beyond that, there's also a multiplier effect to take into account.

As Agustín Delucchi explained, mistakes in transformation stages don't just waste effort, they amplify risks and lead to systemic AI failures. For the Microsoft Director of Globalization AI and Data Science, the rule is clear: "about 75% of the effort is really in building models.”

The Multilingual Impact of Bad Data

The cost of bad data can “sneak up on teams”, according to Jennifer Nacinelli. At the start, a dataset that looks “good enough” can get a project moving, but then the cracks show.

Suddenly, you are dealing with biased outputs, QA cycles that never end, or stakeholders losing confidence because the results do not reflect reality.

“In multilingual projects this is even more obvious: if the data is uneven across languages, performance drops quickly and the client feels it. I have had to step into projects where the budget was drained not by model training or complex pipelines, but by fixing the data after the fact. That is the real hidden cost.”

Jennifer Nacinelli


Jennifer Nacinelli 

AI Data Program Manager, Acolad

AI Data Compliance and Governance as Business Advantage

Reframing AI Data Compliance

AI Data regulations are ever-evolving, especially when considering the different approaches in different regions of the world. Organizations can often worry about regulations such as GDPR, HIPAA, or the EU AI Act slowing them down. But often that conversation can be completely different if compliance is reframed as part of the value proposition.

Regulations as Market Gateways

A crucial factor for many organizations, particularly in regulated industries, and certain geographies, is that AI tech compliance is not optional.

Regulations like the EU's Articial Intelligence Act are now bringing in new hoops for many businesses hoping to harness the power of AI.

All this means that it's vitally important to build-in compliance with those processes right from the start - with how you gather, maintain and refine your data. If you get it right, you can benefit from market access that rivals could miss out on.

Three Dimensions of Compliance

Sam identified three crucial dimensions of AI compliance:

  • Customer privacy – Users share deeply personal data with AI tools, requiring strict safeguards.
  • Training data sourcing – Early LLMs scraped the internet freely, but licensing is now becoming critical as regulations tighten in some regions.
  • Highly regulated industries – Healthcare and finance demand sector-specific oversight.

As Sam noted, “AI is not a product anymore, it’s an ecosystem.”  Navigating compliance across these dimensions is a key differentiator for AI providers — essentially the foundation of strong AI data governance.

“Sooner rather than later, [regulations] are going to become entry points for a market. If you want to be really in the business, you’re gonna have to do it.”

Agustín Da Fieno Delucchi


Agustín Da Fieno Delucchi 

Director of Globalization AI and Data Science, Microsoft

Training Multilingual AI with Real-World Voice Data

Discover how a global tech leader scaled AI speech recognition with 120+ hours of diverse, real-world voice data.

Beyond English-only: Scaling Globally with Diverse Data Sources

Market Reach and Cultural Relevance

AI built only in English leaves businesses at risk of missing global opportunities. Advances like transfer learning (a machine learning technique where a pre-trained model's knowledge is applied to a new, related task, rather than training a model from scratch), and multilingual model bridging, can help improve AI performance with low-resource languages.

“Most of the world does not think or communicate in English. When the model ignores that, they miss not only opportunity, but context and culture and a huge chunk of people.”

Sam Shamsan-1


Sam Shamsan 

CEO, Blomega

Context is King

When it comes to using LLMs for language-related tasks, Agustín argued that their real strength lies in their ability to transfer meaning across contexts, not just languages.

He highlighted the opportunity for organizations to reshape translation memories and linguistic assets into richer, contextualized data for AI training.

The context-adaptability of LLMs is also crucial for highly-context specific tasks beyond the written word, for example its capabilities with audio and video formats.

Delivering Genuinely Multilingual Data

Language is not just translation, it is context, culture, and user behavior. If the training data does not reflect that, adoption stalls. Scalability is not just about more computing power, it is about making the data genuinely multilingual and locally relevant.

“One of the biggest misconceptions I deal with is the idea that English data is enough, and a translation of it will suffice. I manage projects every day where clients are rolling out AI solutions globally, and the results are very clear: a model trained in English might work fine in the US, but it fails when you put it in front of users in Germany, Brazil, or Korea.”

Jennifer Nacinelli


Jennifer Nacinelli 

AI Data Program Manager, Acolad

Optimizing Your AI Data Pipeline: 75% Rule

Preparing Data is the Hard Part

A common failure in AI projects is underestimating the role of AI training data, as too little time, expertise, and project capacity are dedicated to preparation. This stage covers crucial techniques to maximize dataset quality and readiness, including compliance checks, removing personal information, ensuring inclusivity, and other actions where human expertise is essential. 

“75% of [the effort] is going to be preparing the data, making sure that we minimize and reduce as much as possible the bias.”

Agustín Da Fieno Delucchi


Agustín Da Fieno Delucchi 

Director of Globalization AI and Data Science, Microsoft

Optimizing Data Pipelines

Many businesses can feel that their data pipeline is the bottleneck, and Jennifer says how she herself has felt that frustration in projects, though she points out that everything can change when the pipeline is redesigned with efficiency in mind.

For example, by improving the sourcing of speakers, annotation, validation, and managing throughput, what can often be the slowest part of delivery can become the part that actually drives speed. Our AI Data Program Manager gives a very clear example:

"In one case, a client expected us to take weeks to process multilingual audio, and we got it done in days because the pipeline was structured well. For me, the pipeline is not the back office of AI, it is the engine that keeps everything moving."

The Importance of an AI Agent

To remain competitive, companies that deal with the demands of managing global content must not be left behind when it comes to AI-driven platforms. They need their own AI agent developed and to intertwine AI with the human, as they go through. 

Why the Human-in-the-Loop Still Matters

In the field of AI Data Services, there is a constant tension between wanting to have everything automated - and as fast as possible - and realizing you cannot fully remove people from the process.

But, as Jennifer points out, the best outcomes can often come when human experts are involved early, and not as a last resort.

“On some of our healthcare and finance projects, it has been human reviewers who caught subtle but critical errors that the system would never have flagged. These are not “nice-to-have” checks, they are the reason the outputs are trusted. Human-in-the-loop is what makes AI usable in the real world.”

Jennifer Nacinelli


Jennifer Nacinelli 

AI Data Program Manager, Acolad

Precision at Source Equals Data-Driven Decisions and Performance at Scale

AI success depends on high-quality, well-prepared data from the start. Compliance, inclusivity, and optimization aren’t costs to be minimized—they are levers for performance and trust.

For leaders, the takeaway is clear: investing in your data pipeline is not just about reducing risk, it’s about enabling scale, market entry, and customer confidence.

Key Takeaways:

  • Audit your data – Poor quality silently undermines performance and trust.
  • Treat compliance as strategy – Regulations can become a market advantage.
  • Invest in multilingual data – English-only approaches limit growth.
  • Prioritize preparation – Remember the 75% rule: most effort is in data.
  • Build human-in-the-loop systems – Trust requires validation and oversight.
colorful portraits of people surrounding the Acolad logo

Ready to Build AI Efficiency With Quality Data?

Related Resources