Post by slh1234 on Feb 19, 2023 15:54:03 GMT
Not distracting from Ratty’s post nor trying to minimize anything, but in my explanations, if you are reading carefully, you probably have several legitimate questions about the process I’ve begun to describe. This also answers whether a model is just “some woke programmer.”
Let’s take the example of the malaria model/solution. There are actually several models in this total solution, and I didn’t develop all of them. Some are well known. I also am NOT a medical professional, nor am I, in any way, the subject matter expert on malaria. There may be professionals with skill that span both the medical research areas and the data areas, but generally, it is considered to be more efficient to have separate individuals with separate areas of expertise to approach complex problems like this.
First of all, medical researchers develop the questions and problem statements such as “detect malaria in images of red blood cells.” Medical professional also define what is acceptable and which is the area of greater risk. So, for example, it may read “detect malaria in red blood cells with greater than 95% accuracy overall, and greater than 99% accuracy in detecting positive cases.” In a case like this, it would mean we must have less than 1% false negatives, but can handle much greater numbers of false positives so long as we’re above a threshold that makes it worthwhile to not just treat everybody. The data scientist(s) does not participate in establishing the criteria other than encouraging these folks to express a question in a form that can be solved with ML/AI.
When I started this project, I had no idea what malaria looked like in blood cell imaging. The medical professionals took me through images and this let me see that there are clearly defined edges and color contrasts. I was looking for this because there are specific numeric representations associated with edges and contrasts in digital images. I was also provided with a data set with over 6,000 examples of red blood cell images that contain malaria, and over 6,000 examples of red blood cells that were not parasitized. This was my labeled data set, so this becomes a supervised learning experiment. At this point, I give my opinion to the medical SMEs that I think the problem is solvable using AI. (Actually, there are many sets of data available online to do this, and I’m not the first person to take on such a problem, but we needed this for a specific application, so I needed to do this for that particular application). So now comes my part of the experimentation, and when I say “my part,” I don’t mean to imply that I was the only person involved in the experimentation.
So we express a null hypothesis that we cannot predict with greater than 95% confidence whether or not cells have the malaria parasite, or we cannot find find greater than 99% of all positive cases.” The alternative hypothesis is that we can – the opposite of the null hypothesis. We never say we have “proven” anything, but rather, we use probabilities to decide whether we have invalidated the null hypothesis or not, and based on this, decisions can be made on whether or not a model can be useful.
I don’t start by just feeding images into any kind of candidate models. Instead, I need to recognize that images may be .png or .jpg or other formats, they may also be in different sizes, and it’s possible it is an image of several red blood cells, or maybe a red blood cell doesn’t even exist in the picture. I need to start by prepping the data. In this case, I need to choose one specific image type, and for this, I chose .png. So the first thing that needs to be done is to ensure that the file is in this format. OpenCV already has models that will convert other image types into .png. I start off trying to not invent any wheels, and in experimentation, we will learn if this is okay or not, but the first step is in converting to .png. Next, I need to ensure the image contains only the area of interest, and once again, OpenCV had models already that allows me to center images on the area of interest. So here are 2 models already used in processing.
Next, I need to ensure the images are all the same size and scale because I intend to use a convolutional neural network (CNN) to take on the process (I’ll explain that in a bit). I also use OpenCV to standardize the size. Now, I need to convert this .png image into a 3 dimensional array of integers representing the height, width, and color vector of each pixel in the image. This is actual data that the CNN will see and operate on.
As I get ready to try to train models, I have to separate the data randomly into a training and test set. In my case, I used 80% of the data for training, ensuring I had about the same number of positive and negative cases in my data set, then saved 20% for a test data set – data that will never be seen by the algorithm/model during training.
The nice thing about a neural network in supervised training is that it is able to evaluate the patterns without the data scientist actually doing that mundane work. It simply tries the weightings and combinations and compares with similar patterns it has detected, then compares with the label of parasitized or uninfected (which we actually just express as 0 or 1 for such binary classification problems), then after a trip through the entire test set, it will assess its accuracy, then make additional trips through the data (epochs) each adjusting for whether it is getting better or worse than the previous epoch. There are several hyperparameters that affect training such as learning rate (amount to change weightings in each epoch – a discussion in and of itself). There is also a concern that too many epochs can cause overfitting, so there are two metrics we watch and evaluate to try to prevent this. We typically set a high number of epochs, but set early-exit policies in the training process so that when it starts trending the wrong direction for too many epochs (which we define), then we terminate training and present the best model as a candidate to be tested.
A convolutional neural network is used for searching for patterns in arrays such as the 3 dimensional array that the .png images are converted into. A “convolution” means it takes a certain subset, such as a certain height and width, and searches that, then moves a pre-set distance to the side and searches this new area. The convolutional areas should overlap to ensure that no part of the image is left unexamined.
The process of training is also called “fitting,” and here, we take the candidate model that best fits the training data and move to validation or testing. In this, we call “predict” on our candidate model with the test data, and we examine how well the model performs with data it did not see during training. We gather the statistics on this, and see the accuracy rates for positive and negative cases based on the outcomes we already know – this is supervised training. If (and only if) this indicates we have met the acceptance criteria, we also need to figure the probability of this just being a statistical anomaly. We we are above the acceptance criteria, we can move on, but if not, more experimentation is needed.
In this particular case, I started out with a CNN with 2 level of evaluation. It gave about 80% accuracy which is encouraging that the problem is possible, but falls far short of the acceptance criteria. From there, I changed the structure of the CNN to have 4 levels, and then I got nearly 95% accuracy. I still need to validate whether I can improve that, so I changed the design of the CNN to have 6 levels of evaluation, and then I got unmistakably above the acceptance criteria for overall accuracy, but it showed I needed to improve on the number of false negatives I had. I first tried going to 8 levels of evaluation, but this didn’t significantly improve model performance, and since it is significantly more expensive in terms of compute, I decided to try another approach.
In the images, I could sometimes see background noise that seemed to me to be impacting overall accuracy. I needed to try different approaches to minimize the effects of background noise. I tried using models in OpenCV to enhance the images using HSV (Hue, Saturation, and Value). I had to train a new model to interpret the HSV enhanced images. I found it to perform about the same as the model trained on raw images. I also used a model from OpenCV to perform gaussian blurring on the images, and this required another model to be trained. I found it once again to perform about like the model on raw images. I also used a model from OpenCV to convert the images to grayscale and tried to train a model on the grayscale images, but performance was very poor, so I determined that grayscale was not useful.
Looking at the output of the 3 models that were useful, although I could see they gave about the same accuracy, the set of misidentified images was not the same among the three. Seeing this, I tried two different approaches: run the image (converted where necessary) on all 3 models and take a vote where 2 out of 3 makes the final determination of the prediction, and another approach where a prediction of “parasitized” from any of the 3 resulted in a final prediction of “parasitizes,” and only images where all 3 models predicted “uninfected” would result in a final prediction of “uninfected.” This is the rules-based portion of the process that I say is sometimes involved in the final determination of AI processes.
The outcome of the experiment is that taking a vote resulted in the fewest number of errors overall, but it did so by minimizing the number of false positives, and didn’t decrease the number of false negatives that much. Taking the approach of “any one of the models predicting ‘parasitized’ gives a final result of ‘parasitized’ brought the number of false negatives down to meet the acceptance criteria, so the decision was made to follow this approach.
Another step of validation was required, and for this, we took several steps. First of all, the data scientist steps was to take the same structure of CNN, but divide the training and testing data differently and run through the training and testing steps again. This gave us increased confidence that our approach was not producing a statistical anomaly, and we should see consistent results with more general data. Once this was determined, we used additional images provided to test the model/approach and ensure we stayed consistently within the acceptance criteria. Medical professionals (generalized to “Subject Matter Experts”) are involved again in this step to agree that we are, or are not meeting the criteria.
Once approval on the models is met, we need to operationalize. For this, the front end web service is built that can convert the images to the format needed, stringify the image, and submit it to a web endpoint that contained the “scoring script” that did the work of HSV enhancement, Gaussian blurring, calling the models, and going through the rules-based steps of testing if any of the models gave a prediction of “parasitized,” and returning the prediction in a human readable form. This means that medical professionals can concern themselves with medical tasks instead of needing to learn image manipulation, etc. They simply upload a set of images, and each image is returned with the prediction of “uninfected” or “parasitized,” and they can use the combination of image and label in their final diagnosis.
That is a condensed version of a single example of Machine Learning used in AI, and how the AI model was used. I think from that illustration you can see that concerns like “some programmer” really show a lack of understanding of the model development and operationalization process. But also note that this only takes it through initial deployment, and doesn’t take into account the CI/CD going forward from that point. It does produce models that are useful, and this is just one of many examples in the world around you. Translation models, natural language models, evaluations of risk of diabetes, financial projections, market projections, etc. all have many AI models actually in use.
Let’s take the example of the malaria model/solution. There are actually several models in this total solution, and I didn’t develop all of them. Some are well known. I also am NOT a medical professional, nor am I, in any way, the subject matter expert on malaria. There may be professionals with skill that span both the medical research areas and the data areas, but generally, it is considered to be more efficient to have separate individuals with separate areas of expertise to approach complex problems like this.
First of all, medical researchers develop the questions and problem statements such as “detect malaria in images of red blood cells.” Medical professional also define what is acceptable and which is the area of greater risk. So, for example, it may read “detect malaria in red blood cells with greater than 95% accuracy overall, and greater than 99% accuracy in detecting positive cases.” In a case like this, it would mean we must have less than 1% false negatives, but can handle much greater numbers of false positives so long as we’re above a threshold that makes it worthwhile to not just treat everybody. The data scientist(s) does not participate in establishing the criteria other than encouraging these folks to express a question in a form that can be solved with ML/AI.
When I started this project, I had no idea what malaria looked like in blood cell imaging. The medical professionals took me through images and this let me see that there are clearly defined edges and color contrasts. I was looking for this because there are specific numeric representations associated with edges and contrasts in digital images. I was also provided with a data set with over 6,000 examples of red blood cell images that contain malaria, and over 6,000 examples of red blood cells that were not parasitized. This was my labeled data set, so this becomes a supervised learning experiment. At this point, I give my opinion to the medical SMEs that I think the problem is solvable using AI. (Actually, there are many sets of data available online to do this, and I’m not the first person to take on such a problem, but we needed this for a specific application, so I needed to do this for that particular application). So now comes my part of the experimentation, and when I say “my part,” I don’t mean to imply that I was the only person involved in the experimentation.
So we express a null hypothesis that we cannot predict with greater than 95% confidence whether or not cells have the malaria parasite, or we cannot find find greater than 99% of all positive cases.” The alternative hypothesis is that we can – the opposite of the null hypothesis. We never say we have “proven” anything, but rather, we use probabilities to decide whether we have invalidated the null hypothesis or not, and based on this, decisions can be made on whether or not a model can be useful.
I don’t start by just feeding images into any kind of candidate models. Instead, I need to recognize that images may be .png or .jpg or other formats, they may also be in different sizes, and it’s possible it is an image of several red blood cells, or maybe a red blood cell doesn’t even exist in the picture. I need to start by prepping the data. In this case, I need to choose one specific image type, and for this, I chose .png. So the first thing that needs to be done is to ensure that the file is in this format. OpenCV already has models that will convert other image types into .png. I start off trying to not invent any wheels, and in experimentation, we will learn if this is okay or not, but the first step is in converting to .png. Next, I need to ensure the image contains only the area of interest, and once again, OpenCV had models already that allows me to center images on the area of interest. So here are 2 models already used in processing.
Next, I need to ensure the images are all the same size and scale because I intend to use a convolutional neural network (CNN) to take on the process (I’ll explain that in a bit). I also use OpenCV to standardize the size. Now, I need to convert this .png image into a 3 dimensional array of integers representing the height, width, and color vector of each pixel in the image. This is actual data that the CNN will see and operate on.
As I get ready to try to train models, I have to separate the data randomly into a training and test set. In my case, I used 80% of the data for training, ensuring I had about the same number of positive and negative cases in my data set, then saved 20% for a test data set – data that will never be seen by the algorithm/model during training.
The nice thing about a neural network in supervised training is that it is able to evaluate the patterns without the data scientist actually doing that mundane work. It simply tries the weightings and combinations and compares with similar patterns it has detected, then compares with the label of parasitized or uninfected (which we actually just express as 0 or 1 for such binary classification problems), then after a trip through the entire test set, it will assess its accuracy, then make additional trips through the data (epochs) each adjusting for whether it is getting better or worse than the previous epoch. There are several hyperparameters that affect training such as learning rate (amount to change weightings in each epoch – a discussion in and of itself). There is also a concern that too many epochs can cause overfitting, so there are two metrics we watch and evaluate to try to prevent this. We typically set a high number of epochs, but set early-exit policies in the training process so that when it starts trending the wrong direction for too many epochs (which we define), then we terminate training and present the best model as a candidate to be tested.
A convolutional neural network is used for searching for patterns in arrays such as the 3 dimensional array that the .png images are converted into. A “convolution” means it takes a certain subset, such as a certain height and width, and searches that, then moves a pre-set distance to the side and searches this new area. The convolutional areas should overlap to ensure that no part of the image is left unexamined.
The process of training is also called “fitting,” and here, we take the candidate model that best fits the training data and move to validation or testing. In this, we call “predict” on our candidate model with the test data, and we examine how well the model performs with data it did not see during training. We gather the statistics on this, and see the accuracy rates for positive and negative cases based on the outcomes we already know – this is supervised training. If (and only if) this indicates we have met the acceptance criteria, we also need to figure the probability of this just being a statistical anomaly. We we are above the acceptance criteria, we can move on, but if not, more experimentation is needed.
In this particular case, I started out with a CNN with 2 level of evaluation. It gave about 80% accuracy which is encouraging that the problem is possible, but falls far short of the acceptance criteria. From there, I changed the structure of the CNN to have 4 levels, and then I got nearly 95% accuracy. I still need to validate whether I can improve that, so I changed the design of the CNN to have 6 levels of evaluation, and then I got unmistakably above the acceptance criteria for overall accuracy, but it showed I needed to improve on the number of false negatives I had. I first tried going to 8 levels of evaluation, but this didn’t significantly improve model performance, and since it is significantly more expensive in terms of compute, I decided to try another approach.
In the images, I could sometimes see background noise that seemed to me to be impacting overall accuracy. I needed to try different approaches to minimize the effects of background noise. I tried using models in OpenCV to enhance the images using HSV (Hue, Saturation, and Value). I had to train a new model to interpret the HSV enhanced images. I found it to perform about the same as the model trained on raw images. I also used a model from OpenCV to perform gaussian blurring on the images, and this required another model to be trained. I found it once again to perform about like the model on raw images. I also used a model from OpenCV to convert the images to grayscale and tried to train a model on the grayscale images, but performance was very poor, so I determined that grayscale was not useful.
Looking at the output of the 3 models that were useful, although I could see they gave about the same accuracy, the set of misidentified images was not the same among the three. Seeing this, I tried two different approaches: run the image (converted where necessary) on all 3 models and take a vote where 2 out of 3 makes the final determination of the prediction, and another approach where a prediction of “parasitized” from any of the 3 resulted in a final prediction of “parasitizes,” and only images where all 3 models predicted “uninfected” would result in a final prediction of “uninfected.” This is the rules-based portion of the process that I say is sometimes involved in the final determination of AI processes.
The outcome of the experiment is that taking a vote resulted in the fewest number of errors overall, but it did so by minimizing the number of false positives, and didn’t decrease the number of false negatives that much. Taking the approach of “any one of the models predicting ‘parasitized’ gives a final result of ‘parasitized’ brought the number of false negatives down to meet the acceptance criteria, so the decision was made to follow this approach.
Another step of validation was required, and for this, we took several steps. First of all, the data scientist steps was to take the same structure of CNN, but divide the training and testing data differently and run through the training and testing steps again. This gave us increased confidence that our approach was not producing a statistical anomaly, and we should see consistent results with more general data. Once this was determined, we used additional images provided to test the model/approach and ensure we stayed consistently within the acceptance criteria. Medical professionals (generalized to “Subject Matter Experts”) are involved again in this step to agree that we are, or are not meeting the criteria.
Once approval on the models is met, we need to operationalize. For this, the front end web service is built that can convert the images to the format needed, stringify the image, and submit it to a web endpoint that contained the “scoring script” that did the work of HSV enhancement, Gaussian blurring, calling the models, and going through the rules-based steps of testing if any of the models gave a prediction of “parasitized,” and returning the prediction in a human readable form. This means that medical professionals can concern themselves with medical tasks instead of needing to learn image manipulation, etc. They simply upload a set of images, and each image is returned with the prediction of “uninfected” or “parasitized,” and they can use the combination of image and label in their final diagnosis.
That is a condensed version of a single example of Machine Learning used in AI, and how the AI model was used. I think from that illustration you can see that concerns like “some programmer” really show a lack of understanding of the model development and operationalization process. But also note that this only takes it through initial deployment, and doesn’t take into account the CI/CD going forward from that point. It does produce models that are useful, and this is just one of many examples in the world around you. Translation models, natural language models, evaluations of risk of diabetes, financial projections, market projections, etc. all have many AI models actually in use.