Why datasets are an integral part of natural language processing & AI translation

Published on February 11th, 2021

AI translation is advancing every year and not only in terms of features like the new interpreter mode in Google Assistant. The latest translation technology uses neural networks and machine learning to algorithmically improve over time without any input from human beings.

This, combined with technologies like natural language processing (NLP), means that machine translation tools are constantly getting better at understanding the true meaning of content and capturing the same meaning in other languages.

Within the sciences of AI, machine learning, NLP and machine translation, there are dozens of data-driven technologies working together and none of them can function without access to the right datasets.

In this article, we illustrate how important these datasets are for everything in natural language processing and machine translation.

What is natural language processing?

Natural language processing (NLP) is an application of AI technology that trains algorithms to understand human speech or text. With NLP, there is a heavy focus on interpreting everyday language (hence the “natural”) so that people can speak or type, as they normally would, and the algorithm accurately interprets their meaning.

This allows platforms to translate speech in real-time, interpret voice recordings and transcribe natural speech as people talk.

Google Translate’s transcribe feature is surprisingly effective at recognising natural speech.

This is the same technology Google uses to understand the true meaning of your search queries. It’s also used by the latest translation tools for interpreting the input source language and then translating it into the target language(s).

When a user interacts with an NLP application, there are four basic functions taking place:

  1. Input: The user inputs data in the form of speech, text or datasets.
  2. Interpretation: The NLP’s data-driven algorithms aim to understand the input data.
  3. Action: The NLP application then performs its designated action, whether it is transcription, translation or whatever else.
  4. Output: The application then outputs the converted data in the required format (e.g. speech-to-text, English to Spanish etc.)

The technology involved in steps #2 and #3 is particularly advanced and constantly improving. The key word is “data” and the accuracy of NLP systems is almost entirely dependent on the quantity and quality of the data they have access to.

In fact, the only ingredient that is possibly more important than data is processing power, but this is only because more power means systems can crunch larger volumes of data in less time.

Why are datasets important in NLP?

To understand the importance of datasets in natural language processing, you first need an understanding of how NLP uses data.

Natural language processing is a field within machine learning, which trains algorithms to make decisions using vast amounts of data. Algorithms are run through a series of variations and every correct outcome is marked as a success and every incorrect outcome is marked as a failure.

With enough input data, processing power and test runs, algorithms can learn from both successes and failures to deliver increasingly accurate outcomes without human intervention.

Scale is crucial to achieving successful outcomes and the more complex your NLP system is, the more data and testing is required.

For example, the NLP algorithms that power Google Translate use millions of pre-existing translation examples to verify their own outcomes. The tech giant has used this system since 2016, powered by a system of neural networks, which adds up to a lot of data over the past half decade – all of which creates a self-learning NLP system that constantly improves by itself.

Original source: Google AI Blog. Google’s neural networks compare, group and match translations using a geometric map and colour system to determine the most accurate translations between multiple languages.

Machine translation is one of the most complex NLP functions, especially when we are talking about tools like Google Translate or Microsoft Translator, which can translate speech and text in real-time.

Before we get to that, we should recognise the challenges of a more simplistic (relatively speaking) NLP function, such as the processes used by search engines to understand the meaning of queries.

In 2019, Google rolled out a new search algorithm, called BERT (Bidirectional Encoder Representations from Transformers). The algorithm is an NLP system that uses neural networks to help it better understand the meaning of users’ search queries by analysing the entire text string.

At the time, Google said BERT applied to roughly 10% of all queries in English, mainly for complex search terms where simple keyword matching fails to return relevant results.

Original source: Google.

This highlights one of the biggest challenges Google faces, as a search engine, but it also demonstrates the complexity of algorithms understanding natural language – a challenge every translation tool faces.

For the relatively simple act of interpreting search queries in one language, Google claims its BERT algorithm achieves state-of-the-art results for 11 different NLP tasks, using a range of open-source datasets, including:

  • GLUE: The General Language Understanding Evaluation evaluates the performance of models across a diverse range of natural language understanding tasks, allowing for faster learning and improvements.
  • MultiNLI: The Multi-Genre Natural Language Inference is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information.
  • SQuAD: Stanford Question Answering Dataset is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

These datasets help Google understand the meaning of full search queries with greater accuracy and analyse the web’s content in relation to its interpretation, allowing it to deliver more relevant results.

It can also determine with greater accuracy when there is no answer or suitable question for an individual query, which not only improves the experience of the user, but also provides important outcome data that can be fed back into the algorithm for further training.

None of this would be possible without the kind of datasets listed above.

Understanding the NLP challenges translation technology faces

While Google’s search engine is challenged with interpreting user queries and delivering relevant results (which is already difficult enough), NLP translation tools face a far more complex challenge.

Or, more accurately, they face several additional challenges beyond accurately interpreting input language.

The standout task for machine translation is converting speech or text from one language into another. This, in itself, requires algorithms to understand the meaning of the source language in a similar way to Google’s search algorithm. However, the complexity of this source language can vary greatly – anything from search queries and single sentences to entire news stories, documents and works of fiction.

Once the source material is interpreted, the translation tool then faces the task of converting it into the target language while capturing the same intended meaning and overcoming the linguistic disparities between both languages.

The difficulty of this second phase varies, depending on a wide range of factors:

  • The input method (text, handwriting, image, speech, scraped data, etc.)
  • The quality of input (think spelling errors, strong accents, audio interference)
  • The complexity of source language
  • The accuracy of the initial interpretation
  • The source and target languages (English to Spanish is generally easier than English to Chinese)
  • The datasets used
  • The output method (text, digital speech, etc.)

The most advanced translation tools available today are capable of translating speech in real-time, which combines a complex set of NLP tasks related to speech recognition, speech analysis, parsing, natural language understanding (NLU), machine translation and natural language generation (NLG) to name a few.

Every single NLP function requires several specialist datasets in order to work and the list of functions is far longer than most people appreciate.

Some of the most popular databases used to train algorithms for translating news stories specifically, as listed on Wikipedia.

It can be easy to take this for granted when online users can open up Google Translate on their phones, say a few words and watch the app provide an instant translation. Yet, the technology being put to work every time we do this (every time anyone does this) is immense.

To understand the scale of what today’s machine translation tools are doing, let’s look at some of the most important NLP tasks they perform.

Common NLP tasks performed by modern translation tools

Modern translation tools perform a wide set of actions outside of machine translation so that users can input language in the most convenient way (text, speech, etc.) and output their translated content in their chosen format.

So our first set of NLP tasks does not have anything to do with translation itself.

Text and speech processing
  • Optical character recognition (OCR): Detecting text or text representation within an image.
  • Speech recognition: The conversion of spoken word into text that is readable to algorithms and can be outputted as text for the user.
  • Speech segmentation: Separating clips of speech into individual words.
  • Text-to-speech: Converting inputted text into digital speech.
  • Word segmentation (tokenization): Separating chunks of text into individual words.
Morphological analysis
  • Lemmatization: Removes inflectional endings only, to return the base dictionary form of a word (lemma).
  • Morphological segmentation: Separating words into the smallest units of meaning (morphemes) – e.g. “cats” includes two morphemes, “cat” and “s” for plurality.
  • Part-of-speech tagging: Determine and tag the part of speech (POS) of each word – e.g. noun, verb, adjective, etc.
  • Stemming: Reducing inflected or derived words to their root form – e.g. open as the root for opened, opens, opening, etc.
Syntactic analysis
  • Grammar induction: The algorithmic “learning” of the grammar systems within a language.
  • Sentence breaking: Detecting sentence boundaries within speech or text, usually represented by full stops, commas and other forms of punctuation.
  • Parsing: Analysing the input language, applying it to grammatical and contextual models to interpret accurate meaning.
Semantic analysis
  • Named entity recognition (NER): Identify people, places and other named entities within the input language.
  • Relationship extraction: Identify the relationships among named entities, such as who is related to whom.
  • Word sense disambiguation: Determine the meaning of individual words based on context.
High-level NLP tasks
  • Automatic summarisation: Provide an automated summary of a large piece of text or speech.
  • Machine translation: Automatically translated text or speech from one language into another.
  • Natural language generation (NLG): Convert data into readable human language.
  • Natural language understanding (NLU): Accurately interpret the meaning of complex text or speech.
  • Translation memory: Identify repetition within a piece of content and automatically reapply previous translations where the contextual meaning is the same.
  • Question answering: Identifying and answering human questions, as common with search engines and AI chatbots.

We have only covered a fraction of the natural language processing tasks modern translation tools perform and NLP is only one subset of the technologies used in AI translation. This represents a small percentage of the technology baked into every translation performed by today’s most advanced language tools.

Every one of these functions relies on vast amounts of data to train algorithms that can continue to learn by themselves.

Where does all this data come from?

Generally speaking, the more data machine learning algorithms have to work with, the more sophisticated and reliable their outcomes are. However, the quality of this data is also integral to both the training and long-term performance of algorithms.

In today’s data-driven world, there is no shortage of open-source datasets that anyone can access.

That said, like anything in the world of science, it is best to stick to datasets that are peer-reviewed and supported by academic journals and key figures within the data science field.

The extent of datasets available in 2021 is already remarkable and we can expect the volume and quality of data to drastically increase over time.

Earlier, we looked at specific datasets for training algorithms to answer questions but the SQuAD dataset we referenced is quite broad. To build an NLP application, it is likely that users are going to need dozens or hundreds of datasets, many of which cater to very specific functions.

For example, let’s imagine a translation app and focus our attention on one language – Vietnamese – and look at some of the datasets a user might need to use.

Vietnamese text, lexicon & language datasets
  • Vietnamese Dictionary for Model Transformation: An extensive dataset of Vietnamese lexicon, including more than 29,000 rows of word meanings, diacritical marks and letters without accents.
  • ViCon and ViSim-400: Two datasets, one comprising pairs of synonyms and antonyms and another featuring degrees of similarity across five semantic relations.
  • Na Meo, a Hmong-Mien Vietnamese Language: A dataset for the Vietnamese language, Na Meo, one of many languages besides the official Vietnamese language spoken by smaller communities and ethnic groups.
  • Vietnamese Song Corpus: An audio dataset containing Vietnamese songs for training voice recognition and matching tonal variations to written Vietnamese.
Vietnamese parallel text datasets
General datasets including Vietnamese
  • HC Corpora Newspapers: A large dataset of natural language text extracted from newspapers, blogs and social media – available in 67 languages, including Vietnamese.
  • Sentiment Lexicons for 81 Languages: A dataset featuring positive and negative sentiment lexicons in 81 languages (Vietnamese being one of them) for building sentiment analysis models.

This small selection of datasets gives you an idea of how much data advanced translation tools need to perform specific functions, such as interpreting diacritical marks in the Vietnamese writing system, accurately recognising them in spoken language and using them correctly when translating from other languages.

In the wider context of today’s translation technology, this represents a fraction of the data required for one language – before users even start thinking about real-time interpretation, text-to-voice translation, image translation and the various other features provided beyond basic text-to-text translation.

For NLP and machine translation, data is everything

Without the right datasets, machine learning technologies like neural networks have nothing to work with. For natural language processing and machine translation, every function relies on the right data being fed into algorithms throughout the training and operational stages.

The exciting part is that datasets are only growing larger and their quality is only improving over time. Continued advances in computational technology allow algorithms to process more data in a shorter space of time, resulting in machines that learn faster than ever.

Posted on: February 11th, 2021