Site icon translate plus

Why datasets are an integral part of natural language processing & AI translation

AI translation is advancing every year and not only in terms of features like the new interpreter mode in Google Assistant. The latest translation technology uses neural networks and machine learning to algorithmically improve over time without any input from human beings.

This, combined with technologies like natural language processing (NLP), means that machine translation tools are constantly getting better at understanding the true meaning of content and capturing the same meaning in other languages.

Within the sciences of AI, machine learning, NLP and machine translation, there are dozens of data-driven technologies working together and none of them can function without access to the right datasets.

In this article, we illustrate how important these datasets are for everything in natural language processing and machine translation.

What is natural language processing?

Natural language processing (NLP) is an application of AI technology that trains algorithms to understand human speech or text. With NLP, there is a heavy focus on interpreting everyday language (hence the “natural”) so that people can speak or type, as they normally would, and the algorithm accurately interprets their meaning.

This allows platforms to translate speech in real-time, interpret voice recordings and transcribe natural speech as people talk.

Google Translate’s transcribe feature is surprisingly effective at recognising natural speech.

This is the same technology Google uses to understand the true meaning of your search queries. It’s also used by the latest translation tools for interpreting the input source language and then translating it into the target language(s).

When a user interacts with an NLP application, there are four basic functions taking place:

  1. Input: The user inputs data in the form of speech, text or datasets.
  2. Interpretation: The NLP’s data-driven algorithms aim to understand the input data.
  3. Action: The NLP application then performs its designated action, whether it is transcription, translation or whatever else.
  4. Output: The application then outputs the converted data in the required format (e.g. speech-to-text, English to Spanish etc.)

The technology involved in steps #2 and #3 is particularly advanced and constantly improving. The key word is “data” and the accuracy of NLP systems is almost entirely dependent on the quantity and quality of the data they have access to.

In fact, the only ingredient that is possibly more important than data is processing power, but this is only because more power means systems can crunch larger volumes of data in less time.

Why are datasets important in NLP?

To understand the importance of datasets in natural language processing, you first need an understanding of how NLP uses data.

Natural language processing is a field within machine learning, which trains algorithms to make decisions using vast amounts of data. Algorithms are run through a series of variations and every correct outcome is marked as a success and every incorrect outcome is marked as a failure.

With enough input data, processing power and test runs, algorithms can learn from both successes and failures to deliver increasingly accurate outcomes without human intervention.

Scale is crucial to achieving successful outcomes and the more complex your NLP system is, the more data and testing is required.

For example, the NLP algorithms that power Google Translate use millions of pre-existing translation examples to verify their own outcomes. The tech giant has used this system since 2016, powered by a system of neural networks, which adds up to a lot of data over the past half decade – all of which creates a self-learning NLP system that constantly improves by itself.

Original source: Google AI Blog. Google’s neural networks compare, group and match translations using a geometric map and colour system to determine the most accurate translations between multiple languages.

Machine translation is one of the most complex NLP functions, especially when we are talking about tools like Google Translate or Microsoft Translator, which can translate speech and text in real-time.

Before we get to that, we should recognise the challenges of a more simplistic (relatively speaking) NLP function, such as the processes used by search engines to understand the meaning of queries.

In 2019, Google rolled out a new search algorithm, called BERT (Bidirectional Encoder Representations from Transformers). The algorithm is an NLP system that uses neural networks to help it better understand the meaning of users’ search queries by analysing the entire text string.

At the time, Google said BERT applied to roughly 10% of all queries in English, mainly for complex search terms where simple keyword matching fails to return relevant results.

Original source: Google.

This highlights one of the biggest challenges Google faces, as a search engine, but it also demonstrates the complexity of algorithms understanding natural language – a challenge every translation tool faces.

For the relatively simple act of interpreting search queries in one language, Google claims its BERT algorithm achieves state-of-the-art results for 11 different NLP tasks, using a range of open-source datasets, including:

These datasets help Google understand the meaning of full search queries with greater accuracy and analyse the web’s content in relation to its interpretation, allowing it to deliver more relevant results.

It can also determine with greater accuracy when there is no answer or suitable question for an individual query, which not only improves the experience of the user, but also provides important outcome data that can be fed back into the algorithm for further training.

None of this would be possible without the kind of datasets listed above.

Understanding the NLP challenges translation technology faces

While Google’s search engine is challenged with interpreting user queries and delivering relevant results (which is already difficult enough), NLP translation tools face a far more complex challenge.

Or, more accurately, they face several additional challenges beyond accurately interpreting input language.

The standout task for machine translation is converting speech or text from one language into another. This, in itself, requires algorithms to understand the meaning of the source language in a similar way to Google’s search algorithm. However, the complexity of this source language can vary greatly – anything from search queries and single sentences to entire news stories, documents and works of fiction.

Once the source material is interpreted, the translation tool then faces the task of converting it into the target language while capturing the same intended meaning and overcoming the linguistic disparities between both languages.

The difficulty of this second phase varies, depending on a wide range of factors:

The most advanced translation tools available today are capable of translating speech in real-time, which combines a complex set of NLP tasks related to speech recognition, speech analysis, parsing, natural language understanding (NLU), machine translation and natural language generation (NLG) to name a few.

Every single NLP function requires several specialist datasets in order to work and the list of functions is far longer than most people appreciate.

Some of the most popular databases used to train algorithms for translating news stories specifically, as listed on Wikipedia.

It can be easy to take this for granted when online users can open up Google Translate on their phones, say a few words and watch the app provide an instant translation. Yet, the technology being put to work every time we do this (every time anyone does this) is immense.

To understand the scale of what today’s machine translation tools are doing, let’s look at some of the most important NLP tasks they perform.

Common NLP tasks performed by modern translation tools

Modern translation tools perform a wide set of actions outside of machine translation so that users can input language in the most convenient way (text, speech, etc.) and output their translated content in their chosen format.

So our first set of NLP tasks does not have anything to do with translation itself.

Text and speech processing
Morphological analysis
Syntactic analysis
Semantic analysis
High-level NLP tasks

We have only covered a fraction of the natural language processing tasks modern translation tools perform and NLP is only one subset of the technologies used in AI translation. This represents a small percentage of the technology baked into every translation performed by today’s most advanced language tools.

Every one of these functions relies on vast amounts of data to train algorithms that can continue to learn by themselves.

Where does all this data come from?

Generally speaking, the more data machine learning algorithms have to work with, the more sophisticated and reliable their outcomes are. However, the quality of this data is also integral to both the training and long-term performance of algorithms.

In today’s data-driven world, there is no shortage of open-source datasets that anyone can access.

That said, like anything in the world of science, it is best to stick to datasets that are peer-reviewed and supported by academic journals and key figures within the data science field.

The extent of datasets available in 2021 is already remarkable and we can expect the volume and quality of data to drastically increase over time.

Earlier, we looked at specific datasets for training algorithms to answer questions but the SQuAD dataset we referenced is quite broad. To build an NLP application, it is likely that users are going to need dozens or hundreds of datasets, many of which cater to very specific functions.

For example, let’s imagine a translation app and focus our attention on one language – Vietnamese – and look at some of the datasets a user might need to use.

Vietnamese text, lexicon & language datasets
Vietnamese parallel text datasets
General datasets including Vietnamese

This small selection of datasets gives you an idea of how much data advanced translation tools need to perform specific functions, such as interpreting diacritical marks in the Vietnamese writing system, accurately recognising them in spoken language and using them correctly when translating from other languages.

In the wider context of today’s translation technology, this represents a fraction of the data required for one language – before users even start thinking about real-time interpretation, text-to-voice translation, image translation and the various other features provided beyond basic text-to-text translation.

For NLP and machine translation, data is everything

Without the right datasets, machine learning technologies like neural networks have nothing to work with. For natural language processing and machine translation, every function relies on the right data being fed into algorithms throughout the training and operational stages.

The exciting part is that datasets are only growing larger and their quality is only improving over time. Continued advances in computational technology allow algorithms to process more data in a shorter space of time, resulting in machines that learn faster than ever.

Exit mobile version