Creating order out of chaos: natural language as a strategic imperative

Language is central to how organisations do business, shaping processes and impacting how they communicate with their customers. To get ahead of the competition, businesses need to be capable of understanding how to realise tangible value from the language data available from its unstructured inputs.

The challenge is to leverage massive amounts of data in a consistent and scalable format. The solution to this question can help businesses to power processes, deliver business insight and move forward at a quicker pace.

We hosted a roundtable that brought together a group of cloud architects, IT, software engineers, information security, intellectual property and process & innovation professionals to take a deeper look at:

  • How to gain value from unstructured data
  • How to capture and enrich domain knowledge to establish a competitive advantage
  • The most important aspect to understand language data
  • Using an NL platform to improve the productivity and consistency of teams
  • Carrying out research into NLP
  • How to quit the ‘data experiment’ treadmill

Rela8 Group’s Technology Leaders Club roundtables are held under the Chatham House Rule. Names, organisations and some anecdotes have been withheld to protect privacy.

About is the premier artificial intelligence platform for language understanding. Its unique hybrid approach to NL combines symbolic human-like comprehension and machine learning to transform language-intensive processes into practical knowledge, providing the insight needed to improve decision-making throughout organisations.

By offering a full range of on-premise, private and public cloud offerings, augments business operations, accelerates and scales data science capabilities and simplifies AI adoption across a wide range of industries.

The challenges of unstructured data

Unstructured data may occur in many different formats – text, video, sound, image etc. A density recognition provider will supply text or video analytics in multiple variations because the original data may be from different countries or localities. In addition, contextualising written text especially where there is uncertainty, context variation or where a degree of knowledge is required, can be difficult, especially for an ‘out of the box’ solution.

ML and human judgement

In ML, the 80% rule is normally enough. However, in simple email management with a finite set of variations, it is possible to aim much higher. But with domain specific natural language solution, 80% is actually a high benchmark.

A hybrid approach can help with accuracy and the acceleration of annotating the data set. For example, if an organisation needs to pull information from a set of medical records, using existing collateral will take them so far – names, dates, times – but treatment and symptoms may require specific domain understanding. So, while an ‘out of the box’ solution can be used to accelerate the process, an understanding of the situation is required for the rest. With a pure machine-based learning approach a hundred more samples of the same scenario would be required to categorise the data.

In other words, get the quick start and once the ML model has been built, if there are bits or pieces where pure ML was unable to extract or categorise properly, it can be supplemented with strategic, targeted rules. Then, measure, monitor and adjust going forward (feedback loop.) The key point is that with automation, data science and machine learning, there is always going to be an element of human judgement or intuition. In summary, ML closes the gap on low-hanging fruit, creating efficiencies for harder thinking where experts still have domain.

Establishing competitive advantage

The main starting point is to set appropriate business goals according to the problem – rules, flags and priorities, what is and is not important and embedding that knowledge into the Ai solution. So, the success criteria for the organisation should be clearly defined and the right tools for the business problem chosen.

Focusing on the business problem of extracting and categorising meaningful information in the natural language provides insight into the data, which enables the building of a predictive model. Organisations should look for opportunities to embed the subject matter, understanding and knowledge into reusable assets. Then, label and annotate large training sets and build a reusable ML model. Being able to extract and categorise in the context of ‘configurable context’, so sentences, paragraphs and sections on documents or finding the concepts within the context they occur then moving them into data analytics tooling, allows organisations to build predictive models.

When tuning an algorithm, precision comes at the expense of recall, so what is important is having the ability to tune some different linguistic models and compare the results against real training sets, quickly and effectively. This is something that a platform capability can do. There is value in the approach of investing in some kind of pre-processing or normalising/standardising that sits between the natural language and the linguistic model.

Understanding language data

Language is ambiguous, so the only way to get around the problem of ambiguity is to understand the context in which it occurs. When trying to do that with training data, it either requires an enormous volume of variations and variability in language and context to get all of that. Or, a different approach is needed to identify proper meaning in context.

Language data can be anywhere on the spectrum. Completely 100% unstructured to semi-structured to charts, tables and freeform paragraphs of text. For example, there could be a document that is full of text but has a table in it that goes over three pages. It is important to know what the column headings are, even when processing data on the other pages. It is not just about the language itself, it’s about the context in which the table cells occur and knowing what it maps back to. It can be easily forgotten that it is about all the formatting and context – not just in the words, but in everything else as well.

Also, while getting true positives can be a fairly quick exercise, getting rid of the exceptions can be very difficult and time consuming. An additional level of effort is required to get rid of false positives when testing for false positives and false negatives, so it has to be part of the strategy.

Context is everything

In ML, taking a hybrid approach can help, with an ‘out of the box’ solution used to accelerate the process to around 80%, but some domain knowledge or understanding of the situation may be required to categorise the remaining 20%.

Once the ML model has been built, it can be supplemented with targeted rules where there is a variation in context , then create a feedback loop going forward.

Organisations should first focus on the business problem of providing insight into the data by extracting meaningful information in the natural language to allow the building of a predictive model, and embedding the subject matter, understanding and knowledge into reusable assets.

It is important to understand the context in which the language occurs because language is itself is ambiguous. In addition, it is not about understanding just the formatting or context of the words, but everything else as well.

If you want to get in touch then give us a shout