2025-09-10

Optimizing the AI Data Pipeline: From Bottleneck to Performance Driver

Insights from industry leaders on tackling the hidden costs, compliance challenges, multilingual scaling, and pipeline optimization with AI Data.

Optimizing the AI Data Pipeline: From Bottleneck to Performance Driver
Insights from industry leaders on tackling the hidden costs, compliance challenges, multilingual scaling, and pipeline optimization with AI Data.

Why the Right AI Data Matters

Behind every AI breakthrough is one simple truth: without the right data and rigorous AI data analysis to fuel it, even the smartest models fail. Bad data doesn't just slow projects—it destroys trust, wastes investment, and puts entire strategies at risk. In a world where Artificial Intelligence is quickly moving from experimental to essential, understanding how to fix your data foundations is the difference between leading and falling behind.

To achieve true scale and make reliable data-driven decisions, organizations must address the quality, compliance, and multilingual complexity of their large datasets right at the source.

We invited Sam Shamsan (CEO of Blomega) and Agustín Da Fieno Delucchi (Director of Globalization AI and Data Science at Microsoft) to share their expert insights into how to effectively use data to power AI models, a discussion moderated by Acolad's AI Data Program Manager, Jennifer Nacinelli.

During this fascinating discussion, “AI's Data Secret: Why Poor Data is Killing Your Models – and How to Fix It”, they tackled why poor data silently undermines AI investments, how compliance can become a business advantage, and why managing diverse data sources is critical to scale globally. They also explored the "75% rule"—why most of the effort in AI projects lies in preparing data, not building models.

Key topics covered:

  • The hidden costs of poor data quality
  • Turning compliance requirements into competitive advantage
  • The importance of multilingual and culturally relevant data
  • Why 75% of AI success lies in data preparation

AI Data Quality: Hidden Costs and Business Risks

Bad Data Multiplies Downstream

Bad data doesn’t just produce wrong results—it misleads entire AI systems. Sam Shamsan explained how mislabeled data can spiral into major business risks. He gave the example of how a misclassified banking dispute can cause genuine fraud to be downplayed while harmless queries get flagged as fraud, damaging both customer trust and brand reputation.

“The bad data dilemma… it misleads the AI model. It doesn’t only fail to guide it, but it pushes it into a direction that is not desirable at all.”

Sam Shamsan-1


Sam Shamsan 

CEO, Blomega

The Multiplier Effect on Large Datasets

We've all heard of the classic principle of garbage in, garbage out, which is particularly important when working with data for AI models. But beyond that, there's also a multiplier effect to take into account.

As Agustín Delucchi explained, mistakes in transformation stages don't just waste effort, they amplify risks and lead to systemic AI failures.

“There is the multiplier effect that we have when we don’t process well. About 75% of the effort [is] really in building models.”

Agustín Da Fieno Delucchi


Agustín Da Fieno Delucchi 

Director of Globalization AI and Data Science, Microsoft

The Multilingual Impact of Bad Data

The cost of bad data can “sneak up on teams”, according to Jennifer Nacinelli. At the start, a dataset that looks “good enough” can get a project moving, but then the cracks show.

Suddenly, you are dealing with biased outputs, QA cycles that never end, or stakeholders losing confidence because the results do not reflect reality.

“I have had enterprise clients realize that showing strong and transparent data practices is a selling point, not just a legal obligation. In regulated industries especially, being able to prove you are safe to work with becomes a differentiator. Compliance is not just about avoiding fines, it is about being the trusted partner in the room.”

Jennifer Nacinelli


Jennifer Nacinelli 

AI Data Program Manager, Acolad

For success in the AI age, not any data will do. Acolad delivers targeted, accurate, and reliable datasets to ensure the best possible AI and machine learning performance.

AI Data Compliance and Governance as Business Advantage

Reframing AI Data Compliance

AI Data regulations are ever-evolving, especially when considering the different approaches in different regions of the world. Organizations can often worry about regulations such as GDPR, HIPAA, or the EU AI Act slowing them down. But often that conversation can be completely different if compliance is reframed as part of the value proposition.

“In multilingual projects this is even more obvious: if the data is uneven across languages, performance drops quickly and the client feels it. I have had to step into projects where the budget was drained not by model training or complex pipelines, but by fixing the data after the fact. That is the real hidden cost.”

Jennifer Nacinelli


Jennifer Nacinelli 

AI Data Program Manager, Acolad

Regulations as Market Gateways

A crucial factor for many organizations, particularly in regulated industries, and certain geographies, is that AI tech compliance is not optional.

Regulations like the EU's Articial Intelligence Act are now bringing in new hoops for many businesses hoping to harness the power of AI.

All this means that it's vitally important to build-in compliance with those processes right from the start - with how you gather, maintain and refine your data. If you get it right, you can benefit from market access that rivals could miss out on.

“Sooner rather than later, [regulations] are going to become entry points for a market… if you want to be really in the business, you’re gonna have to do it.”

Agustín Da Fieno Delucchi


Agustín Da Fieno Delucchi 

Director of Globalization AI and Data Science, Microsoft

Three Dimensions of Compliance

Sam identified three crucial dimensions of AI compliance:

  • Customer privacy – Users share deeply personal data with AI tools, requiring strict safeguards.
  • Training data sourcing – Early LLMs scraped the internet freely, but licensing is now becoming critical as regulations tighten in some regions.
  • Highly regulated industries – Healthcare and finance demand sector-specific oversight.

As Sam noted, “AI is not a product anymore, it’s an ecosystem.”  Navigating compliance across these dimensions is a key differentiator for AI providers — essentially the foundation of strong AI data governance.

Training Multilingual AI with Real-World Voice Data

Discover how a global tech leader scaled AI speech recognition with 120+ hours of diverse, real-world voice data.

Beyond English-only: Scaling Globally with Diverse Data Sources

Market Reach and Cultural Relevance

AI built only in English leaves businesses at risk of missing global opportunities. Advances like transfer learning (a machine learning technique where a pre-trained model's knowledge is applied to a new, related task, rather than training a model from scratch), and multilingual model bridging, can help improve AI performance with low-resource languages.

“Early on, most of the software companies, specifically in the Bay Area, they noticed that English only... It's just not going to cut it.
Most of the world does not think or communicate in English… when the model ignored that, they miss not only opportunity, but context and culture and a huge chunk of people.”

Sam Shamsan-1


Sam Shamsan 

CEO, Blomega

Context is King

When it comes to using LLMs for language-related tasks, Agustín argued that their real strength lies in their ability to transfer meaning across contexts, not just languages.

He highlighted the opportunity for organizations to reshape translation memories and linguistic assets into richer, contextualized data for AI training.

The context-adaptability of LLMs is also crucial for highly-context specific tasks beyond the written word, for example its capabilities with audio and video formats.

“You have a multi-modality to it, you have the audio, you have the video. [It] can tap into accents, because even with the same language you can have many, many different accents. And with hyper-personalization, it's so good for me to speak not only in English, but in my accent type of English.”

Sam Shamsan-1


Sam Shamsan 

CEO, Blomega

Delivering Genuinely Multilingual Data

Language is not just translation, it is context, culture, and user behavior. If the training data does not reflect that, adoption stalls. Scalability is not just about more computing power, it is about making the data genuinely multilingual and locally relevant.

“One of the biggest misconceptions I deal with is the idea that English data is enough, and a translation of it will suffice. I manage projects every day where clients are rolling out AI solutions globally, and the results are very clear: a model trained in English might work fine in the US, but it fails when you put it in front of users in Germany, Brazil, or Korea.”

Jennifer Nacinelli


Jennifer Nacinelli 

AI Data Program Manager, Acolad

Optimizing Your AI Data Pipeline: The Importance of the 75% Rule

Preparing Data is the Hard Part

A common failure in AI projects is underestimating the role of AI training data, as too little time, expertise, and project capacity are dedicated to preparation. This stage covers crucial techniques to maximize dataset quality and readiness, including compliance checks, removing personal information, ensuring inclusivity, and other actions where human expertise is essential. 

“75% of [the effort] is going to be preparing the data… making sure that we minimize and reduce as much as possible the bias.”

Agustín Da Fieno Delucchi


Agustín Da Fieno Delucchi 

Director of Globalization AI and Data Science, Microsoft

Optimizing Data Pipelines

Many businesses can feel that their data pipeline is the bottleneck, and Jennifer says how she herself has felt that frustration in projects, though she points out that everything can change when the pipeline is redesigned with efficiency in mind.

For example, by improving the sourcing of speakers, annotation, validation, and managing throughput, what can often be the slowest part of delivery can become the part that actually drives speed.

“In one case, a client expected us to take weeks to process multilingual audio, and we got it done in days because the pipeline was structured well. For me, the pipeline is not the back office of AI, it is the engine that keeps everything moving.”

Jennifer Nacinelli


Jennifer Nacinelli 

AI Data Program Manager, Acolad

The Importance of an AI Agent

To remain competitive, companies that deal with the demands of managing global content must not be left behind when it comes to AI-driven platforms. 

“You need your own AI agent developed. You need to intertwine AI with the human, as you go through. You can define what percentage, what role the human will play, but any company without a platform, they are doomed to be eaten for the next 2-3 years. The sauce will be what type of algorithm that you developed in the back end, that embedded with your own platform, that will make you stand out from everybody else.”

Sam Shamsan-1


Sam Shamsan 

CEO, Blomega

Why the Human-in-the-Loop Still Matters

In the field of AI Data Services, there is a constant tension between wanting to have everything automated - and as fast as possible - and realizing you cannot fully remove people from the process.

But, as Jennifer points out, the best outcomes can often come when human experts are involved early, and not as a last resort.

“On some of our healthcare and finance projects, it has been human reviewers who caught subtle but critical errors that the system would never have flagged. These are not “nice-to-have” checks, they are the reason the outputs are trusted. From my perspective, human-in-the-loop is what makes AI usable in the real world, because the real world is shaped by humans”

Jennifer Nacinelli


Jennifer Nacinelli 

AI Data Program Manager, Acolad

Precision at Source Equals Data-Driven Decisions and Performance at Scale

AI success depends on high-quality, well-prepared data from the start. Compliance, inclusivity, and optimization aren’t costs to be minimized—they are levers for performance and trust.

For leaders, the takeaway is clear: investing in your data pipeline is not just about reducing risk, it’s about enabling scale, market entry, and customer confidence.

Key Takeaways:

  • Audit your data – Poor quality silently undermines performance and trust.
  • Treat compliance as strategy – Regulations can become a market advantage.
  • Invest in multilingual data – English-only approaches limit growth.
  • Prioritize preparation – Remember the 75% rule: most effort is in data.
  • Build human-in-the-loop systems – Trust requires validation and oversight.
colorful portraits of people surrounding the Acolad logo

Ready to Build AI Efficiency With Quality Data?

Related Resources