An Introduction To Artificial Intelligence
Daniel Whitenack is a data scientist for SIL International and he’s on a mission to demystify AI.
Before he was a data scientist and AI person, he started out his career in computational physics — studying the many-body theory of electrons and atoms and molecules scattering off of one another.
He developed a good amount of modeling, computer and math skills.
When Daniel went into the industry, it was the time of hype for data science — before the hype around AI. He got a job as a data scientist.
Over the next decade, he’s covered fraud detection, pricing optimization, and analyzing comments on news posts.
Currently, he’s working on AI technology that benefits local language communities — speech recognition for local languages, machine translation, and different natural language processing techniques.
WHAT IS AI
Some kind of messaging needs to get out there because you hear a lot of hype about AI without really knowing what it is.
I think of AI in terms of functions.
A function is simply: you give me some input, I give you some output.
You give me a number and I add one to it. That’s an “add” function.
You give me an input; I do a transformation and get an output.
Functions have existed in software engineering since the beginning of software engineering. Still, not all of them have been AI functions. In software engineering, people “hand curate” the logic associated with those functions.
If I want an “add” function or a math function, I write it and specify precisely what happens to the input, how it’s transformed, and how I provide output.
That’s hand-curated by me, the human.
What an AI, machine learning or deep learning function does is instead of me as the programmer or as the developer specifying all the logic of that function, I create a function with internal parameters in it I don’t set.
I leave those as parameters. They’re set by the computer through trial and error.
This is training in AI.
Let’s say you wanted to create an AI function or an AI model that recognizes cats in images.
You could write a function that says, “If I see a certain color in the image, then classify the image as a cat.” And you wouldn’t specify the exact color you’re looking for because you want the computer to set this parameter.
You’ve defined your function.
Now you need to train your model to fit that function or find that parameter. If you have several images, some of them may be cats, and some of them may not be cats.
You let the computer do trial and error to try different colors and see which one gives you the best results for classifying cats.
Obviously, that function wouldn’t work very well because it only has one parameter, and computer vision is quite complicated.
But ultimately, that’s what an AI or machine learning model looks like. The core of the technology is letting the computer do the work to run one function or one process that sets the parameters for a different function, which you use to do your task — computer vision or machine translation or something similar.
TRAINING THE ALGORITHM – WHAT DOES TRAINING DATA LOOK LIKE? WHAT IS GOOD TRAINING DATA? WHAT IS BAD TRAINING DATA?
Training data is having samples with the kinds of inputs you expect to see in actual production or actual usage of the model — but paired with the desired outcome you expect your model to produce.
For finding cats in images, you’d have many images, and then you’d have the correct label for those.
Some labels would say, “Not cat,” and some would say, “Cat.”
This is a combination of samples of the input and known labels for that input that allows me to do the following.
This is logical. We do this all the time as humans. When a child wants to learn what a cat is, you can show them a picture of not a cat and say, “Not a cat,” and a picture of a cat and say, “It’s a cat.”
If you show them enough labeled examples, they’ll figure out which ones are the cats.
It’s the same process for training the algorithm. We have data labeled “Cat” or “Not cat.” Then we make a random choice of the parameters in our model — any initial choice of those parameters. Then we make predictions for all of those images based on our initial guess for what our parameters should be.
You say, “Algorithm. Remember all this inside of this AI function.”
We check how well we did. We know all the answers and we should have got them right because we know all the ones that are cats are not cats.
If we’ve got only 60% right, we need to change the parameters a bit and recalculate our predictions to see if we can do better.
The training process is doing it over and over and over and over. Thankfully, computers are great at repetitive tasks and doing them on a large scale.
There are ways to speed it up without relying on brute force — we can follow an optimal trajectory to the correct answer.
FOR A TRAINING DATA SET, IF WE HAVE ALREADY LABELED IMAGES OF CATS CAN WE THEN SAY, “HERE ARE THE CATS, ALGORITHM. YOU FIND THE PARAMETERS WHICH ARE COMMON TO ALL OF THEM, AND THEN RUN IT AGAINST A LARGER DATA SET?”
It’s not far off the truth, but I’d phrase it differently.
People expect AI to be magical. That the computer knows the answer as if by magic.
We have an unparameterized function — a data transformation — where we specified the structure of that. Still, we don’t know the correct parameters. We don’t know which ones we should use.
We need the computer to figure out the right or optimal parameters to transform input images into output labels of cats.
But we have to start somewhere. We could give the computer that structure and set the initial parameters as all zeros, random numbers, or some other scheme of setting those and let the computer iteratively go over the images and update the parameters.
People think it’s less structured than it is. Ultimately, the name of the game is parameter fitting, in most cases, except for a few unique AI research problems.
HOW IMPORTANT IS THE INITIAL PARAMETER? WILL IT TAKE LESS COMPUTE TO FINISH THE TASK?
It also depends on the complexity of the problem.
The bulk of AI, in the industry, uses the idea of transfer learning.
Suppose you’ve trained a model to recognize dogs and images. You already have your parameters for your dog model.
I want to recognize cats and images. Those two tasks are very similar. Why don’t I just start with the parameter set you found for your dog model rather than starting from some random seed?
It makes sense because my task is very similar to yours. I can take the knowledge you’ve already developed in your parameter set and tweak it a bit.
That’s going to be a lot more computationally favorable than always starting from scratch.
IS THIS WHAT WE REFER TO SOMETIMES AS PRE-TRAINED MODELS OR PRE-TRAINED ALGORITHMS?
That’s correct.
When the parameters are already set, we call that a pre-trained model.
The jargon is a little confusing with the “pre” part because it means before — but pre-trained means it’s already trained before I even get to use it. I don’t have to retrain or refit those parameters because they’re already set.
IS THIS ALGORITHM AN “OFF THE SHELF” PRODUCT USERS CAN APPLY TO THEIR OWN USE CASES? OR ARE THERE A PLETHORA OF A VARIETY OF ALGORITHMS THEY CAN CHOOSE FROM?
There’s a host of different algorithms to choose from.
Researchers have come up with these different models — or architectures.
In our continual effort to confuse people with jargon, we have a name for it.
Architecture is a neural network configuration. There are specific “layers” of a neural network we bolt together to create an architecture.
That whole thing bolted together is our neural network.
Some layers are better at processing certain types of data than other ones.
For example, image data is often processed with convolutional layers.
When we look at an image, we don’t take in the entire image at once. We look at different parts of it, “There’s a plane up here” or “There Are some edges down here that look like road strips.”
We process different information in different parts of the image.
A convolution layer convolves over the image with a kernel. It looks at different parts of the image rather than the whole thing.
Although you can use convolutional layers for natural language processing, what’s used is recurrent layers, which look at a sequence of things.
Text is a sequence of words or characters, depending on how you break it up. The order of that sequence matters — the relationship between things in the sequence matters. Using a type of architecture of a neural network made up of layers that process sequences of things is helpful in natural language processing.
There’re many things people try that work better or worse in some instances.
ARE NEURAL NETWORKS A TYPE OF AI? DO THEY BELONG TO THE DEEP LEARNING SIDE OR MACHINE LEARNING SIDE OF THINGS?
The answer to that isn’t straightforward.
People use the term AI, machine learning, and deep learning in a host of various ways. This has made the distinctions between those terms hard to define.
On the machine learning side, many times, what people think of are models that have been used for quite some time that aren’t neural networks — decision trees, random forests, naïve Bayes and other things.
On the deep learning side, people work with neural network-based architectures that are larger in scope.
A simple way to think about their differences is this.
Machine learning models traditionally needed more expert input.
If you’re modeling text, maybe you have a unique model intended to understand how linguistics works and how text is related.
In the deep learning world, the thought process is different because you still want to know what my input data looks like, but you’re creating a big function with millions and millions of parameters that can model just about any sort of input-output relationship.
And so it’s less interpretable.
Deep learning is more of a black box versus traditional machine learning because with the latter you understand the structure of your model more.
The simplest of machine learning models, based around linear regression, might have 2-5 parameters.
Something more robust, like a random forest model, might have hundreds or thousands of parameters.
As soon as you add in convolution or recurrent layers in a deep learning model, your model sort of expands to millions or even billions of parameters.
One distinction data-wise is if you’re going to fit a model with over a billion parameters, you will not do it with 100 images.
The scale is not right.
In the deep learning world, where you have these larger models with many parameters, you need more data to properly fit those parameters.
There are models like customer lifetime value models used in marketing and sales that require very little data to work nicely in the machine learning world.
The scale is definitely different.
That doesn’t mean that there is nothing you can do if you don’t have billions of images to train on.
A lot of industry AI is built on pre-trained models and transfer learning. You may only have 100 images in your data set, but if you can build on a model that’s already been trained on 10 million images, then you might be just fine.
DOES DEEP LEARNING REQUIRE SIGNIFICANTLY MORE COMPUTE?
There are two phases of AI work with a model.
One is the training side — you fit the parameters of your model.
The next is inference — you make predictions.
You’ve already trained all the parameters. You don’t have to retrain your parameters — every time you make a prediction, you do the inference part.
The computational needs are slightly different in both cases.
For deep learning, on the training side, specialized hardware — GPUs — comes into play.
Neural networks are structured that, at the very core of what’s happening, are matrix-based operations.
Graphical processing units, or GPUs, are great at doing those operations. To train a deep learning model, you need one or more GPUs. You won’t make any advancement for extensive models if you don’t have a cluster of these GPU-enabled machines.
And suppose you want to beat the record on an object recognition benchmark. In that case, you’ll need a specialized hardware cluster with many tens or hundreds of GPUs to train your model because there’s so much data.
Where I work, we only have a handful of GPUs, some of which are on-premise. That’s sufficient for our needs because we’re not trying to break benchmarks and use transfer learning techniques.
For the inference side, it’s a mixed bag.
Let’s say you’re trying to do real-time speech recognition or real-time image processing.
“Here’s an image.”
If you need the answer in less than a second, you might still need specialized inference hardware. Either GPU or another processing unit for inference, like graph core units that can keep up with the pace of real-time processing with a large model.
You can run many AI models for batch processing on the inference side on a CPU. There may be memory constraints and such because your model is 3 gigabytes big, and you need to load it into memory. But it’s not as intensive on the inference side.
Plus, for various models, even that can be optimized and run on things like your phone or a Raspberry Pi doing complicated operations, like image recognition and computer vision tasks.
WHEN IS IT BEST TO USE DEEP LEARNING? WHEN IS IT BEST TO USE MACHINE LEARNING?
The first question is,
Do you need to use either?
Sometimes, not at all.
An excellent way to decide is this:
In your work, can you check off one or both of these boxes?
- Is there a complicated transformation of data that needs to happen? One where even humans can’t think of how that transformation would happen.
I’ll give you an example. Machines can detect mental health and cardiac issues from a person’s voice. It’s unlikely you could do that as a human. AI models can do that and solve something a human can’t because the transformation of the data is so out of what we would think of.
- Is scale an issue in what you’re trying to do?
You can recognize cats and images better than any computer. Or as good as any computer.
But if I give you 2 million images and say, “Label all the cats,” you’ll get tired. Depending on what deadline I give you, you might not be happy.
Even if that operation can be done by a human, sometimes you still want to automate it at scale with a model.
These are two things to keep in mind when you think of machine learning or deep learning.
When are they an excellent choice?
Two things would motivate staying away from deep learning models.
One is interpretability. There’s a high burden for specific industries, such as financial or health care or government. You need to explain and audit the decisions you make.
If you want to do that and have more interpretability of how you decided, you’re probably after a more simple model.
The other thing is performance. There is specialized hardware involved in an AI project and you may not have access to that. If a more simple model, a machine learning model, can be trained on my laptop, if that can solve my problem, then I shouldn’t have my company spend $2 million on a specialized AI hardware cluster.
Deep learning is excellent at problems that have not been solved in other ways, such as recognizing health issues, specific computer vision tasks or complicated optimizations.
Machine learning models are excellent at processing parts going through a manufacturing line. Say you’re trying to put a chip into a board. They do an outstanding job with detecting edges, “Here’s an edge, and then I’m going to move over 3 centimeters and put in the chip”.
What they’re not doing is perception — they won’t know where the socket is so they can put in the chip.
Deep learning has more to offer for the perception side of things.
DEEP LEARNING IS LIKE A BLACK BOX
We don’t have control over what’s happening inside the algorithm. We can’t precisely document the data we put in and how exactly the transformation happened to get the result.
This may hold certain things back for the industry.
However, there is a large-scale trend where two things are happening.
One is that people are researching and providing tools for interpretability of these models that help us understand where they’re biased and what features are causing the model to make certain decisions.
The tools are gradually increasing.
The other thing is that we’re getting better at testing our work.
We’ve been in the wild west of AI for quite some time — primarily scientists and data scientists doing stuff. Not software engineers. And they do it around the idea of experiments, not production-ready systems.
I’m production-ready.
When I’m using AI, I need ways to test things in software engineering. Unit testing and CICD, or continuous integration/continuous deployment technology, have been developed and standardized over time. There are plentiful best practices for making sure you understand how your software will behave before you put it into production.
We are making progress with these techniques in the AI world. We have things like adversarial attacks, which is a way to probe the behavior of your model. We have ideas such as minimum viability testing, which helps you ensure your model behaves for many test cases in the way you expect, as the model is updated.
People are being more responsible in their AI development, which has a positive impact on how people perceive these things.
Anytime you can put documentation around what you’re doing and showing that you’ve run it through standardized tests, that helps with transparency and sets standards.
Has it been tested in such-and-such ways? Has the industry decided they’re happy with that? Does it meet specifications?
WHERE IS AI ON THE HYPE CYCLE?
For some time, there was a hype around AI and data science and that they would solve all of our problems.
If you think AI solves every problem, then you’re overestimating its utility and where we’re at.
However, on the other side, if you’re a business of any size, operating at least some type of technology, and you think AI can solve none of your problems — you’re also mistaken.
We’re somewhere in the middle.
We haven’t got to the plateau of productivity yet. But we’ve gotten over some of those initial unwarranted expectations. We’re getting more into the nitty-gritty of how this can be a part of our new layer in our software stack.
We’re advancing best practices and standards around testing and understanding the behavior of our models — on the way to that plateau.
You should consider AI if you’re not thinking about it at all. Especially if you’re leading technology-type things at your company.
AI doesn’t need to be your primary strategy but think about it.
WHAT DO THESE DEFINITIONS MEAN? NARROW/WEAK AI, STRONG/GENERAL AI AND SUPERINTELLIGENCE?
Narrow methods are task-specific.
Think of spam detection in your inbox. It won’t do a magnificent job doing that in someone else’s inbox because it’s fine-tuned to yours. It’s not transferable. You can’t retrain it.
The next area is where things are more transferrable.
Think of natural language processing and computer vision. You can reuse many of these general-purpose models Google and Facebook create for various tasks. These models multi-task and the knowledge is transferable to a variety of tasks.
AGI, or general intelligence, is learning to learn. Models generalize to several tasks on their own with little distinction between outputs from computers or humans.
WHAT DO THE NEXT FIVE YEARS LOOK LIKE FOR THE PROGRESS OF AI?
It’s hard to predict the future. I know I’ll be wrong.
Best practices and more rigorous software engineering practices are happening around AI.
Tooling goes along with those and now there’s an entire world called MLOps, operations for machine learning, geared towards training, tracking and versioning the things we do. That trend will continue.
For the models, two things are happening.
One is that they aren’t so data specific. Meaning they’re not text, image, video or audio but multimodal models that take and fuse inputs from various modalities of data and do interesting things with them.
One example of this would be removing background noise in audio. If you have a video of a person speaking, plus their audio, you can trim out the background noise because you also have the information from their lips.
It makes sense — that’s how we interact with people in a noisy environment. We look at their lips and know what they’re saying. That’s a trend I see continuing.
Last, there are various new types and architectures of models people explore that are natively different from the models we’ve been exploring for some time. Like graph-based models where we don’t process matrices internally, we process graph structured things — attractive in various areas where we have hierarchical data.
IS MULTIMODAL THE SAME AS DATA FUSION?
Fusing beads of data from various sensors where there’s a diversity of data into a single model is like a multimodal model in AI. There, we’d have a video, an audio or a text feed coming together and they’re encoded into the same model rather than handled by two.
The only difference being is that we’re not talking about having two different models for each type of data and then combining the data in a rule-based way after that.
We’re talking about a single model that takes in both forms of data. That’s what we operate on jointly within the same model.
—————————————-
I am constantly amazed by people like Daniel, who’s great at taking complicated concepts and condensing them down into something I can understand. Something people like us can understand.
This is such an underrated skill.
The pushback I hear on this sometimes is we’re dumbing things down. There’s almost a fear that if we do that, then we’ll misrepresent the facts.
I don’t see it as dumbing down or misrepresentation. I see it as having empathy for the people we’re seeking to serve.
They may not need to know everything there is to know about machine learning, but they need someone to filter it for them. They need someone to show up and provide a summary.
Like an executive summary — they get the information they need to decide on what you’re asking them to do.
If you’re in the business of persuading people to take an action, then you’re also in the business of creating tension around that action. Things could be better; this is an opportunity, there might be and another way. These are statements that create tensions and the lead to change.
You’ll never get to try out your ideas and test things that may not work if you can’t clearly communicate what you’re doing.
And not just clearly communicate it so that you understand or the people in the industry understand it, but to the people you’re serving.
I am clearly not an expert in AI. But my job is not to be an expert. My job is to be curious and to ask questions.
What if you’re talking with someone who’s not curious? Who isn’t asking questions? Maybe someone who is already suffering from decision fatigue?
They’ll just say, “No, thanks, this is not for me.” But that “no thanks” might mean you don’t get to try out your idea. It might mean you don’t get to do exciting work or make things better by making better things.
Next time you find yourself in a position where you need to communicate a technical concept to perhaps a less technical audience or an audience outside of our industry, take the time to explain things in such a way they can understand it.
It’s a generous act and not pandering to the lowest common denominator.