Metadata Management in Unstructured Data
Our guest this week is Kirk Marple, the founder of Unstruk data. He began his career in geospatial with his first job, making maps and performing various geospatial analyses. He later transitioned into software development in the media space, but still doubled back to geospatial from time to time. Currently, Kirk is most focused on dealing with geospatial data at Unstruk Data.
What is Unstructured Data?
Unstructured data is everything from imagery to audio, 3D point clouds, documents, emails, and many other digital files. The term unstructured is a bit of a misnomer, since there is always a known schema or file format for all these file types. The term unstructured may refer more to how people view data since a lot of people do not think about the bits on the disk when viewing data, but are more concerned with the contents.
Metadata in Unstructured Data
The metadata of unstructured data provide a starting point for working with unstructured data. They can be classified into three levels:
First Order Metadata is the data in the header of a file. It is the bare minimum of metadata that one can get out of a file i.e., you can read the EXIF data of an image, but if you are unable to read the image, you will not know what was actually captured.
Second Order Metadata is the data that helps in reading the file and identifying its contents. In the case of images, models are used to detect objects and identify what was captured. Bounding boxes and their tags, often used in training machine learning models, are perfect examples of second-order metadata in images.
Third Order Metadata is data pulled from making inferences across a bunch of related data and linked databases. This data provides a framework for contextualization that creates edges, like in a knowledge graph, that connect something to something else. This can be thought of as the spider web that grows bigger as more edges are created, as more inferences are pulled.
Is Geospatial Data Unstructured Data?
Geospatial data is less unstructured largely because it has that extra bit of context- geographical location. It is common to have other information besides the location data in the EXIF metadata. For a phone or drone image you might get the speed, acceleration, camera angle, and a lot more information that gives geospatial data more structure. This makes it possible to get a tonne of information from a single file, just from the metadata.
Knowledge graphs follow the same principle as a relational database. While in a database there are tables with keys that link one table to another via related records, knowledge graphs morph and pivot on edges created from inferencing a bunch of data. They are more dynamic, and do not have the same challenge of constantly updating the schema like you would need to accomplish the same in a database. Knowledge graphs provide the flexibility of inventing new edges on the go and pivoting on any entity in the system to find all the things that they relate to.
How Far Out Should a Knowledge Graph Grow?
Knowledge graphs grow bigger and bigger when new connections are established from new data. The knowledge graph’s ‘spider web’ is theoretically infinite, since data enrichment is recursive. The spidering can continue on and on as long as more context can be found from an edge or link. There is a danger in having a never-ending spread of data.
The risk itself lies in the concept of data enrichment being almost boundless. As such, excessive resources might be tied up to do enrichment of data that may never be needed by the customer. This places the customer’s needs at a critical position as the determining factor of how far out to grow a knowledge graph.
A good general rule is to start cutting off the spider if more data is coming in, but no new changes are apparent on the graph.
Dark data is data in the archives that is no longer being used. Unstructured data has a tendency to go dark fast. There is a growing stream of new unstructured data from drones, robots, and even mobile phones that make it easy to toss old data away in preference of the new data that is supplied. A company that does aerial surveys may do analysis on a day’s data, or a week’s data, but when the data ages out a little bit, it goes dark, and becomes obsolete. Structuring this data in a knowledge graph provides a way to look across years of data and begin to see trends and commonalities. It bridges the gaps between daily workflows, and historical analytics.
Machine Learning Models
Models are an essential part of building knowledge graphs. They are the backbone for building edges and links upon which further inferences can be made. In order to develop these models, a human has to be put in the loop to train the model, review, and then validate its results.
Model training is a continuous process to make the model more accurate in detecting objects. If the model gets it wrong the inferences made on that result would be inaccurate as well.
Practically, there is no one all-purpose model. Models are trained for specific applications, and then used alongside each other in an ontology to detect objects as each has been trained.
Today, there are third party vendors that provide models which are trained generically. These can be used as a starting point to filter out results, but may not be useful in identifying more specific things.
For instance, if a generic model identifies a feature as a building, another model that is trained to differentiate buildings can be run to identify whether the building is a shed or garage. More specific models for features like windows or doors can further be strung together to produce even more specific results.
Layering models in this way, defined in a parent child relationship, is helpful in cost management. For instance, instead of running a window model on a whole image where the buildings cover only 10% of the area, filtering out will remove the 90% area that has no buildings, creating less data to be processed by the next, more intensive model. This carving out data optimizes cost and performance.
Taking Metadata Management to Edge Computing
Edge computing is a concept of pushing computing closer to the data. In satellite platforms, it is used to reduce the amount of data being sent back to Earth by doing computations out in space, removing the bad pieces, and keeping and sending only the useful data.
In IoT, more and more data are being collected from a variety of sensors. Currently, some companies are starting to use edge computing to reduce the bulk of data sent from these sensors when data is captured. This is mainly by processing metadata and moving through first order to second order metadata at the source.
There is a danger in the shift of the kind of metadata that is being created at the source.
If something is not correct right at the source, it might be impossible to come back to it since it is now about metadata management and not file management, which is much more fickle. This is why continuous training of the models is invaluable to dial in accuracy. A human review, and approval process is a critical element of the loop.
New Possibilities with Unstructured data
Machine learning models and knowledge graphs provide a gateway to new products that can be built on processing unstructured data. Having a semantic search is one of the possibilities.
This can prove to be very useful for industries like oil and gas, or real estate. A real estate inspector can use photos of rental apartments to pull a tonne of information about the facilities i.e., results of the last inspection, crime data, any reports and much more.
A web catalogue service that is searchable on the web can help to bring these capabilities to the public.
This might be an open, public API that serves as a crowdsource catalogue for publishing data into a knowledge graph, which then is made accessible to the public. This catalogue can be a source of useful links that expose unobvious analytics. By making it public, the knowledge itself becomes open, and enables the next person down the line to make use of it.
Recommended Podcast Episodes