Most companies run enterprise systems that utilize RDBMS to manage data in tables, rows and columns. However there are also ‘unstructured’ content that would not fit in the RDBMS model. Examples of these would be a Tweet or Facebook posting from the marketing department, an internal or external training it video or a presentation posted on a partner’s web site. Another example could be through the company’s website and clinical trial technology pages. New technology is needed to manage such examples of unstructured information.
- XML – Extensible Markup Language (not really new as XML dates back to 1997) is a powerful language used to describe data for human and machine consumption. XML can store and transport data, is self-descriptive (holds both data and metadata), and is designed to emphasize simplicity, generality and usability across the web.
- Semantic Web – aka Web 3.0, web of data, linked data, the Semantic Web is designed to emphasize the meaning of information instead of the structure of information. The Semantic Web is supported by a metadata data model specification known as the Resource Description Framework (RDF), and is used to catalog data in such a way that data is presented and shared regardless of the data format, original source, structure or container.
- NoSQL – ‘Non SQL’ or ‘Not Only SQL’ is the data store for unstructured data, just as RDBMS SQL is used to store structured data. NoSQL is set apart by scaling ‘horizontally’ and interacts with clusters of machines. It uses data structures such as key-value, graphs, document, and triples.
- Predictive Analytics – this is the branch of science that sets out to find meaning from information available for the purpose of making decisions and projections. This can only be achieved when structured and unstructured data can be harmonized. The clinical technology required for this work includes the various implementations of NoSQL (MongoDB, MarkLogic, Apache Couchbase, Cloudera) and various tools used in the Semantic Web (i.e. SPARQL, OWL)
There is more work to be done to address the gaps of unstructured content management. These include maintenance of data quality (how clean/accurate is the unstructured data), data categorization (how to make the most sense of your unstructured data), data harmonization (how to best merge the unstructured data into your structured data architecture), and data volume (how to store it all).
All industries, whether life-science, health-care, communication, finance, social media, and retail are grappling with managing and making the most sense of the vast amount of raw information (a.k.a.: data lakes). Given the direction that the web is heading towards, the amount of information will only increase exponentially despite the challenges of unstructured data; industries are left with no choice but to determine the best way forward to manage it. In order to achieve progress with information, industry leaders need to adopt new innovative technologies and clinical trial technology.