Extract meaningful information from unstructured data
We interview Mike DeCesaris, vice president of data analytics for Cornerstone Research, about the challenges of working with unstructured data and how his team has developed personalized processes to turn it into valuable information that customers can use.
What is unstructured data and how does it relate to litigation?
Traditional data analysis typically involves analyzing structured data, such as spreadsheets or relational databases. Unstructured data, on the other hand, is essentially any information that is not stored according to a predefined structure. Examples of unstructured data include text documents, emails, Adobe PDF files, image files, and more. Some believe that unstructured data makes up 80% or more of all data, and unstructured data sets are growing rapidly.
The information in unstructured documents can be crucial in supporting expert analysis, but locating and extracting relevant information can be difficult when there are large volumes of unstructured data. While structured data can be processed and analyzed using traditional database tools and data analysis programs, analyzing unstructured data requires either many hours of manual work or a level significantly higher technical expertise and sophistication.
Given the large amount of unstructured data in our work, how is Cornerstone Research responding?
We have developed sophisticated tools that can be used together to create tailored approaches to transform unstructured data into structured data. The result can be used for quantitative analysis. This can eliminate large-scale manual review, greatly reducing processing time and costs. Perhaps more importantly, it can open up new possibilities for analysis that would otherwise have been impossible.
For example, we have:
developed a parallelized data processing pipeline to convert hundreds of thousands of pages (hundreds of gigabytes) of daily reports into multiple separate text report formats into tables and extract key information to enable cost effective analysis in multiple joint defense issues ;
digitized a large number of image-based account statements with various counterparties and automated the creation of machine-readable transaction datasets;
identified PDF files of emails in a 250,000 page document dump containing relevant business tables and data programmatically extracted and aggregated into a database; and
Entries extracted and structured from consumer complaint forms into a comprehensive database.
In the event of a dispute, we often process sensitive customer information. That’s why Cornerstone Research has invested heavily in secure infrastructure, including high-performance, high-throughput analytical servers and storage clusters. Our analytical infrastructure is on-premises, which means customer data is never exposed to the web. We have also invested in a number of software tools and programming languages to add high-quality text layers to documents, quickly extract tabular data, and develop custom approaches to extract key information. Finally, we have invested in people: we have exceptional data scientists and practitioners with many years of experience in a large number of different clients and projects.
What are some of the challenges of working with this kind of data?
Extracting meaningful information from unstructured data is nuanced for a number of reasons. We may use documents stored in PDF (.pdf) file format as an example. PDF files are stored as vector graphics (basically an image). Some PDF files may also contain a text layer which can be combined with the image to render a searchable PDF document, but not all of them. Thus, before any text extraction can begin, an interpreted text layer based on the underlying images must be added to the PDF files.
The number of documents and the size of each document also pose processing time issues. Customers can easily deliver thousands or even millions of PDF documents with thousands of pages each. Without the right hardware, software and coding capabilities, manual processing of these documents would take years of person-hours and be prohibitively expensive.
Finally, the content of documents can vary considerably. A single document can contain information in several types of formats. This means that any attempt to extract meaningful data from files requires extremely high precision to distinguish different reports from each other, but at the same time must have the flexibility to capture key information expressed in different formats.
Can you tell us how Cornerstone Research typically approaches working with unstructured data?
We can use our sample PDF documents to show how we transform unstructured information into a structured format that can be used in analyzes. The first step in any text mining exercise is to examine a sample of the documents and determine the key pieces of information essential for the analysis. This step is fundamental to understanding the structure of the content.
The next step in preprocessing PDF files is to make sure that they contain what is commonly referred to as a “text layer”. The text layer of each document is then separated from its original PDF and stored as a plain text file (.txt file extension), which lends itself to very efficient and flexible processing methods.
Once the documents are stored as plain text, we run them through proprietary software programs. Using complex conditional logic and text-matching language, programs discern relevant information, including different types and sections of reports, metadata such as dates and customer IDs, and tables containing records of interest.
To transform the extracted information into a format that can be parsed, we load the now structured text into a database. We take advantage of parallel processing to load multiple intermediate files at once, and data from all records is loaded into one or more tables.
The last step is to validate the quality of the extracted data. Our quality assurance processes include independently replicated text extraction to verify results; calculate coverage statistics to ensure that there are no information gaps; and frequent collaboration with subject matter experts to control product quality.
In short, what are some other examples of how Cornerstone Research works with unstructured data?
By far the most common type of unstructured data processing in our work resembles the example above, where we extract and organize unstructured data that is visually tabular in nature. Increasingly, however, we are dealing with more complex extractions and characterizations of text, images, and even audio and video documents. This work sometimes focuses on extracting concrete information from documents, such as critical references in free text, text transcriptions from video clips, and the detection of logos and products in images.
In other cases, we aim to quantify more abstract concepts, such as the sentiment associated with social media posts, the thematic composition of public news articles, and the characterization of multimedia marketing materials. This work typically uses AI, machine learning, and text analysis techniques to analyze unstructured data. We hope to cover these topics in more depth in future episodes of this series.
Unstructured data can provide windows to all facets of an organization and its processes, and the growth of unstructured data is expected to accelerate as machine-generated data and machine learning initiatives become more widely. used. The quality of the data extracted from our process is reproducible and reliable and can be effectively leveraged to support expert analyzes in litigation and regulations.