What is Training Data?
To back up here a bit, deep learning and machine learning are both types of artificial intelligence (AI). Ultimately, the goal of AI is to help humans solve problems more efficiently. Although they take a hefty time and resource investment to set up in the beginning, a deep learning or machine learning model can replace a variety of manual processes that will save man hours in the long run. If a model is trained well, it is not uncommon for the finished AI model to even surpass human accuracy in some situations. Training data is a collection of annotated data (images, videos, vector data, etc.) that demonstrate to a deep learning or machine learning model’s algorithm how to repeatedly and consistently extract the information desired. Basically, training data is created by human annotators, and then fed into a model to “teach” it what to look for, and train it to mimic the human annotator’s decision process. Training data can be very time consuming to build. The more complicated the objective of your model, the more training data it is going to take to train the model. Additionally, if you are hoping to pull multiple data points and add attributes (ie. this is a car + this is a car with 4 doors) from each training data source, this will also add time and complexity, but the payout in the end will potentially be better for what you need.Types of Annotation for Machine and Deep Learning
Annotation is the human part of the process in generating training data for a model. Annotators will go in and physically mark out the features that they want the model to learn how to identify, and maybe add additional tags to help describe the image if needed. There are different types of annotation and labeling, and some are better matched to certain use cases than others. The simplest form of annotation is simple classification of images. These are binary decisions about an image. Is it a dog or a cat? Is it day or night? Is the target present or not? This kind of training data can be generated quickly, but does not produce very detailed or informative outputs when compared to other annotation methods. The coarsest method of marked annotation is the bounding box. Annotators will mark out the lower left and upper right bounds of the feature they are highlighting, and the generalized box represents the feature. It is even possible to create 3D bounding boxes, called cuboids, to use point clouds as training data. This method is great for uses like object tracking, as there are a lot of changes in movement going on that would be difficult to precisely mark, and an increased level of granularity would not add much to the output. A step up from the bounding box is using polygons, which is basically digitizing the target feature. These allow the annotator to more precisely delineate the feature’s extent, and collect more specific information. For example, collecting polygons of cars, this could allow you to help train the model to more accurately identify if the car’s doors are opened, or closed. Of course, this method takes more time for the annotator than the bounding box, but the additional information and improved accuracy may be important for your use case. Other vector collections can be taken, such as points to track facial movements, or lines to track and predict routes, but polygons are the most common of the big three data types used here. As far as raster training data methods, we have semantic segmentation, instance segmentation, and panoptic segmentation to choose from. Semantic segmentation is the practice of marking all of the pixels of the desired object as the “correct answer”. Instance segmentation is very similar, but adds the extra level of assigning a unique identity to each feature. For example, with semantic segmentation, we would have data tagged “car, car, car”, whereas with instance segmentation we would have “car 1, car 2, car 3”. Panoptic segmentation is the practice of each pixel being marked as something. Instead of just giving an identity to the target feature, we would also tag pixels of the background, sky, buildings, etc. Raster training data classification methods are generally the most resource consuming to create, but they are known to produce the most accurate and precise products in the resulting model. Considering the potential payoff, this is definitely appealing, but there are some things to take into consideration if you want reliable and consistent results.Creating and Maintaining Quality Training Data
Your algorithm will only ever be as good as your training data. This is why it is vital to understand what makes for high quality training data before getting deeper into the process. The key element to getting expected results with your final model, is to use training data that is as close to the data the final model will be run with as possible. This is known as ground truthing. If you will be using the model with 3 band rasters and 1x1m cell sizes, your training data should be of the same type. Differences in resolution can cause issues in your model as it is now seeing the target in a different context than how it was trained too, resulting in loss of quality in the results, if it works at all. Some other differences to consider are the types of sensor used, the angle the target is viewed from, and lighting and weather conditions Let’s say you have gone through the whole process of creating a machine learning model to identify cars in an image. You have annotated your training data, built and trained the model, and are getting the results you expected, but now your organization’s needs have changed and they need to know what color each car is as well. Do you need to start the model building process over from scratch? Thankfully, no, you don’t. If you have a working model, you can choose to go through and update its training data to adjust the model to your new use case. In our car example, the annotator can go through and tag each of the training samples with just the color, as the car has already been delineated. Another option is to use the outputs from the previously created model to train your new one. If your model is giving you image chips of cars, then you are already most of the way there, and can simply classify the car colors using those outputs, and plug them into your next model. A logical question is to wonder if the human role in this process can be removed. That is unlikely, as introducing human logic and thinking into the system is necessary in order to make sure the end product is meaningful to humans. When computers are left to their own devices they will take short cuts and make interpretations that make no sense to people, and can render the output useless. Keeping humans in the loop encourages transparency throughout development of the model, and ultimately results in a better product. Artificial intelligence is, of course, still a young field. 10 years from now the landscape will have changed, and things we thought impossible we may have just failed to consider possible. As the technology finds its way into more markets, new use cases will develop, and with them, new ideas.In Conversation
What iMerit Does
Daniel: Mallory, welcome to the podcast. Can you introduce yourself?
Mallory: iMerit is a data labelling services company for AI and ML use cases. We’re not building algorithms or providing databases — we’re a service that labels data for the purpose of training AI/ML models across a wide variety of use cases. I’m a Solutions Architect — I work with new incoming customers who have a use case and want to spin up a project. I dive in to understand what they’re trying to do short-term and long-term, then help them understand the requirements, how the project will work, and how we’ll execute it.
Training Data: Programming Through Humans
Daniel: What is training data?
Mallory: Training data is the data you use to teach your algorithm how to do something. You need it to recognise specific objects or make specific judgements you care about. That’s where the phrase “human in the loop” comes from. Your initial data set is raw data — purchased, or collected through whatever mechanism (drones, cameras, satellite imagery). Training data is the result of human annotators going through and labelling it according to the criteria your algorithm needs to identify. The humans do the exact same task the algorithm will — except manually, by looking at images and marking and classifying things. The output is a usable training data set you feed through your models to derive insights.
Daniel: So a training data set is essentially a set of correct answers?
Mallory: Yes. The quality of the training data directly impacts the quality of the output. An algorithm is only ever going to be as good as its training data. You want that data as accurate and close to ground truth as you can possibly make it.
Daniel: Take “find all the cars” as an example.
Mallory: First step: define what you mean by car. Does that include trucks? Vans? Semis? Motorised scooters? You may need additional attributes or classes — separating cars from trucks, sedans from hatchbacks. Then we need to know what type of annotation — bounding box? Down-to-the-pixel precision? Then you transfer data to my team, we work according to your guidelines, do internal quality checks with a different set of eyes, and send it back. You may want to do your own check to evaluate how well we understood the brief. At that point you have your training data.
Matching Training Data to Production Data
Daniel: So I can’t just give you a Google search of cars?
Mallory: No — the training imagery (or video, lidar, satellite) should be as similar to your long-term production data as possible. Your algorithm learns from the annotations and from the images themselves. Train on Google imagery but deploy on high-res images from your own vehicles and there’s a discrepancy that hurts accuracy. Plan to get a subset of the actual data you’ll use long-term.
Daniel: Is the discrepancy only about pixel resolution, or also about conditions and sensors?
Mallory: All of the above. Day vs night, sunny vs cloudy vs rainy — all affect imagery. You want variety in your training set so your algorithm can identify objects under varying conditions. Different satellite providers can have different offsets, so annotations might not line up exactly across data sets. Annotations are only as good as the resolution they’re done on — higher resolution and denser point clouds lead to more precise annotations and better algorithm output.
Continuous Learning and Model Iteration
Daniel: What about continuous learning?
Mallory: Increasingly common. Human-in-the-loop is continuous — not one and done. New data sources become available; needs change. Maybe a year in you need to know whether a car is moving or parked, or what colour it is. You can use the algorithm’s existing output as input: we add additional annotations on top of what your model already produced, or correct it (QC). Sometimes we’re not drawing from scratch — we’re augmenting and improving what the model has done.
Daniel: Can you remove data from a model?
Mallory: Honestly I haven’t seen anyone want to do that. You could in principle have human annotators remove annotations you no longer want and feed that back as negative signal — there are ways to assign importance — but it hasn’t come across my desk. Interesting thought.
Annotation Types: From Bounding Boxes to Panoptic Segmentation
Daniel: Walk me through annotation types.
Mallory: I use “annotation” and “labelling” interchangeably, though sometimes “labels” or “tags” specifically mean classification — applied to the whole image, not drawn on it.
Mallory: Bounding boxes are the simplest — a box around an object, defined by top-left and bottom-right coordinates. Mostly used for object detection and tracking. In lidar there’s 3D bounding boxes called cuboids. Then polygons — more precise contours, essentially digitisation. Points (often called keypoints) for things like facial recognition — corners of eyes, corners of mouth, points on the nose. Lines for roads, rivers, routes. Those are the main vector types.
Mallory: On the raster side there’s semantic segmentation — painting all pixels belonging to a class. Panoptic segmentation classifies every single pixel in the image as something. Instance segmentation is in between — marking semantic categories but with discrete instances (car 1, car 2, car 3) rather than just “car” as a blob.
Classification, Attributes, and the CAPTCHA Question
Daniel: The CAPTCHAs you see online — “click all the cars,” “click all the buses” — is that classification tagging?
Mallory: A good example. You’re not marking specific pixel locations — you’re marking whether the image contains the object. Classification tagging.
Daniel: Can you attach attributes to an annotation?
Mallory: Yes — to bounding boxes, polygons, points, lines, anything. Maybe each car gets the colour, whether it’s parked, whether there are passengers. You include attributes if they matter for your algorithm to identify them later. If you just want a count of cars in an image, additional attributes are wasted effort. Depends entirely on the use case.
Choosing the Right Annotation Type
Daniel: How do I know which annotation type to use?
Mallory: Raster annotations are more precise — important when your algorithm needs the exact contours of an object. Bounding boxes are less precise but great for object tracking in video where someone is running or playing basketball — you don’t want annotators drawing a polygon around every moving limb. You want to maximise precision and accuracy where you need it while also maximising volume and cost-efficiency. A simpler annotation lets you process much more data through the workflow.
Mallory: Think in stages — initial training data might be high-volume, less precise annotations focused on simple object detection. Then continuous training adds nuance with more precise annotation types: differentiating a person standing from a person sitting, a car with a door open from one with all doors closed. Building complexity through iterations.
Will Humans Always Be in the Loop?
Daniel: Can you imagine a time when humans aren’t needed in the loop?
Mallory: So far the answer seems to be no. Human context and input have always been needed to get an algorithm to do something meaningful to human users. Even after the initial training, ongoing improvements bring humans back in. There’s a famous story from a few years ago — engineers at Facebook were experimenting with AI for encryption. They had two competing algorithms trying to communicate without a third being able to decipher them. The competitive approach succeeded — but the two algorithms ended up creating their own encryption language that was completely unreadable to humans. Useful to computers, useless to us. They shut it down. That’s what happens without human context — AI does things that make sense to computers but are unusable by humans. The whole point of AI is to help us, so an unintelligible AI has no purpose. I think humans and AI will always be a partnership.
Daniel: Where are we on the hype curve? Do people show up with realistic expectations?
Mallory: Mostly yes. There have been a few cases of overkill — typically when use cases are very subjective. Subjectivity is hard to deal with in training data; you have to account for it across multiple annotators. Some clients are unsure of how much they need or where to start, but the use cases themselves are usually realistic.
Daniel: What use cases excite you?
Mallory: Utilities management. Doesn’t sound exciting, but utility companies and asset owners are realising they can use AI to evaluate public infrastructure — electrical towers, gas lines, roads, bridges. Fly a drone over your assets, run the imagery through AI, and get back condition assessments, damage classifications, debris detection, inventory. Instead of inspectors covering miles and miles of assets manually, you get continuous coverage. Medical AI is also a big growth area — pharmaceuticals, hospitals, medical research, surgery assistance, scanning X-rays and ultrasounds for tumours and cancer cells. I’m not as close to it personally, but it’s remarkable.
Daniel: I was sure you were going to say autonomous vehicles.
Mallory: Definitely an industry driver — it’s been a front-runner for the whole AI industry for a long time, and there’s still huge movement there.
Daniel: Where can people learn more?
Mallory: imerit.net for our verticals, use cases, and contact info. Our team is on LinkedIn. Pre-COVID we were at all kinds of AI conferences — hopefully we’ll get back to that soon. Reach out and we can guide you through the process.

