- 1. Introduction
- 2. Definitions
- 3. How neural networks learn to recognize images
- 4. Machine learning workflow
- 5. Recognizing images from CNN
- 6. Practical application of neural networks for captcha recognition
- 7. Conclusion
One of the most common uses of TensorFlow and Keras is in image recognition and classification. If you would like to learn how to use Keras for image classification or recognition, this article will teach you how to do it.
If you do not understand the basic concepts of image recognition, it will be difficult for you to fully understand the main part of this article. Therefore, before we continue, let's define the terminology.
TensorFlow is an open source library built for Python by the Google Brain team. TensorFlow compiles many different algorithms and models, allowing the user to implement deep neural networks for use in tasks such as image recognition and classification, and natural language processing. TensorFlow is a powerful framework that functions by implementing a series of processing nodes, each representing a mathematical operation, and the entire series of nodes is called a "graph".
Speaking of Keras, it is a high-level API (Application Programming Interface) that can use TensorFlow functions (as well as other ML libraries like Theano). Keras was designed with convenience and modularity as guidelines. As a practical matter, Keras makes many of TensorFlow's powerful yet often complex features simple to implement, and it is configured to work with Python without any major changes or tweaks.
Image recognition refers to the task of inputting an image into a neural network and assigning some kind of label to that image. The label that the network outputs will match the predefined class. Several classes can be assigned at once, or only one. If there is only one class, the term “recognition” is usually used, while the problem of recognizing multiple classes is often called “classification”.
A subset of image classifications is already a definition of objects, when certain instances of objects are identified as belonging to a certain class, for example, animals, cars or people.
A striking example of such a classification is the solution to the most common captcha - ReCaptcha v2 from Google, where from the set of pictures you need to select only those that belong to the class specified in the description.
To perform image recognition or classification, the neural network must perform feature extraction. Traits are the data items that are of maximum interest and that will be transmitted over the neural network. In the specific case of image recognition, such features are groups of pixels, such as lines and points, which the network will analyze for the presence of a pattern.
Feature recognition (or feature extraction) is the process of extracting relevant features from the input image so that they can be analyzed. Many images contain annotations or metadata that help the neural network find relevant features.
Understanding how a neural network recognizes images will help you in implementing a neural network model, so let's take a quick look at the image recognition process in the following sections.
The first layer of the neural network takes in all the pixels in the image. After all the data is entered into the network, various filters are applied to the image, which form an understanding of the different parts of the image. It is feature extraction that creates "feature maps".
This process of extracting features from an image is done with a "convolutional layer" and the convolution simply forms a representation of a portion of the image. It is from this concept of convolution that we derive the term Convolutional Neural Network (CNN), a type of neural network most commonly used in image classification and recognition.
If you want to visualize exactly how feature mapping works, imagine holding a flashlight over an image in a dark room. As you slide the beam over the picture, you learn about the features of the picture. A filter is what the network uses to form a representation of the image, and in this metaphor, the light from the flashlight is the filter.
The beam width of your flashlight determines the size of the portion of the image that you view at one time, and neural networks have a similar parameter - the filter size. The filter size affects how many pixels are checked at one time. The overall filter size used in CNN is 3 and it covers both height and width, so the filter checks for a 3 x 3 pixel area.
While the size of the filter covers the height and width of the filter, the depth of the filter must also be specified.
But how can a 2D image have depth?
The fact is that digital images are displayed in the form of height, width and some RGB value, which determines the color of the pixel, so the tracked "depth" is the number of color channels that the image has. Grayscale (non-color) images have only 1 color channel, while color images have 3 channels depth.
This all means that for a filter of size 3 applied to a full color image, the final dimensions of this filter will be 3 x 3 x 3. For each pixel covered by this filter, the network multiplies the filter values by the values of the pixels themselves to get a numerical representation of that pixel. ... This process is then performed over the entire image to get a complete picture. The filter moves through the rest of the image according to a parameter called "step", which determines how many pixels the filter should move after it calculates a value at its current position. The usual stride size for CNN is 2.
The end result of all these calculations is a feature map. This process is usually done with several filters to help preserve the complexity of the image.
After an image feature map has been created, the values representing the image are passed through an activation function or activation layer. The activation function takes these values, which, thanks to the convolutional layer, are in linear form (that is, just a list of numbers) and increases their non-linearity, since the images themselves are non-linear.
The typical activation function used to achieve this is the rectified linear unit (ReLU), although there are some other activation functions that are sometimes used as well (you can read about them here .
Upon activation, the data is sent through the merging layer. Concatenation "simplifies" an image: it takes information that represents the image and compresses it. The pooling process makes the network more flexible and better able to recognize objects and images based on their respective functions.
When we look at an image, we, as a rule, are not concerned with all the information (for example, what is in the background of the image), but only with the signs that we are interested in - people, animals, etc.
Likewise, CNN's merge layer will get rid of unnecessary parts of the image, leaving only the parts that it considers relevant, depending on the size of the merge layer.
Since the network must make decisions about the most important parts of the image, the expectation is that it will study only those parts of the image that really represent the essence of the object in question. This helps prevent “overfitting,” when the network learns all aspects of the case study too well and can no longer generalize to new data because it takes into account irrelevant differences.
There are various ways to combine values, but the most commonly used is maximum concatenation. Maximum merging implies taking the maximum value among pixels within one filter (within one image fragment). This cuts out 3/4 of the information, assuming a 2 x 2 filter is used.
The maximum pixel values are used to take into account possible image distortion, and the number of parameters (image size) are reduced to control overfitting. There are other merging principles such as mean or sum merging, but these are not used as often because maximizing merging gives more precision.
The final layers of our CNN - the tightly coupled layers - require the data to be vectorized for further processing. For this reason, the data needs to be “brought together”. To do this, the values are compressed into a long vector or column of sequentially ordered numbers.
The finite layers of CNN are tightly connected layers or Artificial neural networks (ANN). The main function of ANN is to analyze the input features and combine them into various attributes that will help in classification. These layers form sets of neurons that represent different parts of the object in question, and the set of neurons can be, for example, the drooping ears of a dog or the redness of an apple. When a sufficient number of these neurons are activated in response to the input image, then it will be classified as an object.
Error or difference between calculated values and the expected value in the training set is calculated using ANN. Then the network is subjected to the backpropagation method, where the influence of this neuron on the neuron in the next layer is calculated and then its influence (weight) is corrected. This is to optimize the performance of the model. This process is repeated over and over again: this is how the network learns from the data and learns the connections between the input features and the output classes.
The neurons in the middle fully connected layers will output binary values related to the possible classes. If you have four different classes (say, dog, car, house, and person), the neuron will have a value of "1" for the class it thinks represents the image, and a value of "0" for the other classes.
The final fully connected layer, having received the output of the previous layer, assigns the probability to each of the classes within one (in aggregate). If the category "dog" is assigned a value of 0.75, this means a 75% chance that the image is a dog.
Before we move on to an example of training an image classifier, let's take a moment to understand the machine learning workflow or pipeline. The process of training a neural network model is fairly standard and can be divided into four different stages.
The image classifier has now been trained and the images can be submitted to CNN, which will now infer a guess about the content of this image.
First, you will need to collect your data and format it so that the resin neural network learns from it. This includes collecting images and labeling them. Even if you downloaded a dataset prepared by someone else, you probably still need some preprocessing or preparation before you can use it for training. Preparing data is an art in and of itself, associated with solving problems such as missing values, corrupted data, data in the wrong format, incorrect labels, etc.
In this article, we will be using a preprocessed dataset.
Creating a neural network model involves choosing various parameters and hyperparameters. You must decide on the number of layers used in your model, what the size of the input and output layers will be, what activation functions you will use, whether you will use Dropout, etc.
Understanding which parameters and hyperparameters are worth using will come over time (there is a lot to learn), but there are some basic methods you can use at the start, and we will look at some of them in our example.
After your model is created, you simply need to create an instance of the model and fit it to your data for training. When training a model, the greatest attention is paid to the amount of time required for training. You can specify the duration of network training by specifying the number of training epochs. The longer you train the model, the higher its efficiency, but if you use too many learning epochs, you risk overfitting the model.
Choosing the number of epochs to train is something that you will learn to determine over time, and as a rule, you should always keep the weights of the neural network between training sessions so that you do not have to start over after making some progress in training.
There are several steps to evaluating a model. The first step is to compare the performance of the model with a set of test data: data on which the model has not been trained. Thus, you will test the model's performance with this new dataset and analyze its performance using various metrics.
There are various metrics for determining the performance of a neural network model, but the most common is "precision", which is the number of correctly classified images divided by the total number of images in your dataset.
After you see the accuracy of the model in the validation dataset, you will probably go back and re-train the network using slightly tweaked parameters, as you are unlikely to be satisfied with the performance of your network the first time you train. You will continue to tune the parameters of your network, retrain it, and measure performance until you are satisfied with the accuracy of the network.
Finally, you will test the network performance on a test set. This is another dataset that your model has never seen before.
You might be wondering: Why do we need another test dataset? After all, you already got an idea of the accuracy of your model, wasn't that the purpose of the “test set”?
The thing is that all the parameter changes that you made, tweaking the network when working with a “validation dataset” in combination with repeated retesting of this dataset, could cause your network to learn some of the features of the set, but it does not generalize out-of-sample data as well. This is why you should provide the network with completely new test data.
The purpose of a test suite is to check for issues such as overfitting to be more confident that your model is truly usable in the real world.
We've covered a lot already and if all this information was perhaps a little unclear, then combining the above concepts in a sampled classifier trained on a dataset should finally clear everything up.So let's look at a complete example of image recognition using Keras - from loading data to evaluating the effectiveness of the model.
First, we need a training dataset. For this example, we will use the well-known CIFAR-10 dataset. CIFAR-10 is a large dataset containing over 60,000 images representing 10 different classes of objects such as cats, airplanes and cars.
The images are full color RGB, but they are quite small, only 32 x 32. The great thing about the CIFAR-10 dataset is that it comes bundled with Keras, so loading the dataset is easy and the images themselves only need minimal preprocessing.
The first thing we need to do is import the required libraries. You will see exactly how this import works along the way, but for now, just keep in mind that we will be using Numpy and various Keras-related modules:
import numpy from keras.models import Sequential from keras.layers import Dense, Dropout, Flatten, BatchNormalization, Activation from keras.layers.convolutional import Conv2D, MaxPooling2D from keras.constraints import maxnorm from keras.utils import np_utils
We are going to use a random SEED (Symmetric Block Crypto Algorithm based on the Feistel Network) so that the results obtained in this article can be reproduced by you, so we need numpy:
# Set random seed for purposes of reproducibility seed = 21
Now we need to do one more import: the dataset itself.
from keras.datasets import cifar10
Now let's load the dataset. We can do this by simply specifying which variables we want to load the data into, and then use the load_data() function:
# loading in the data (X_train, y_train), (X_test, y_test) = cifar10.load_data()
In most cases, you will need to do some preprocessing of your data to get it ready for use, but since we are using a pre-packaged dataset, this processing is minimized. One of the actions we want to do is normalize the input data.
If the input data values are too wide, it can adversely affect network performance. In our case, the input values are pixels in the image, which have a value between 0 and 255.
So, to normalize the data, we can simply divide the image values by 255. To do this, we first need to convert the data to floating point since they are currently integers. We can do this using Numpy's astype() command and then declare the data type we want:
# normalize the inputs from 0-255 to between 0 and 1 by dividing by 255 X_train = X_train.astype('float32') X_test = X_test.astype('float32') X_train = X_train / 255.0 X_test = X_test / 255.0
The next thing we need to do to prepare the data for the web is to translate it into unitary code. We will not go into the details of unitary coding, but be aware that images cannot be used by a neural network in the form in which they are - they must first be coded, and it is best to use unitary coding when performing binary classification.
We successfully apply binary classification here, because the image either belongs to one particular class, or it does not: it cannot be somewhere in the middle. For unitary encoding, use the Numpy to_categorical() command. This is why we imported the np_utils function from Keras, as it contains to_categorical().
We also need to set the number of classes in the dataset so that we understand how many neurons to compress the final layer:
# one hot encode outputs y_train = np_utils.to_categorical(y_train) y_test = np_utils.to_categorical(y_test) class_num = y_test.shape
We have reached the design stage of the CNN model. The first thing to do is define the format that we would like to use for the model. Keras has several different formats (plans) for building models, but most often one hundred uses Sequential - so we imported it from Keras.
model = Sequential()
The first layer in our model is the convolutional layer. It will take input and pass it through convolutional filters.
When implementing this in Keras, we must specify the number of channels (filters) that we need (which is 32), the filter size (3 x 3 in our case), the login form (when creating the first layer), the activation function and padding.
As already mentioned, relu is the most common activation function, and we define padding using padding = 'same', that is, we do not resize the image:
model.add(Conv2D(32, (3, 3), input_shape=X_train.shape[1:], padding='same')) model.add(Activation('relu'))
Note: You can also concatenate the required commands into one line, for example:
model.add(Conv2D(32, (3, 3), input_shape=(3, 32, 32), activation='relu', padding='same'))
We will now create an exclusion layer to prevent overfitting that randomly removes connections between layers (0.2 means it discards 20% of existing connections):
We can also do batch normalization. Batch normalization normalizes the inputs to the next layer, ensuring that the network always creates activation functions with the same distribution as we want:
Now another convolutional layer follows, but the filter size increases so that the network can already learn more complex representations:
model.add(Conv2D(64, (3, 3), padding='same')) model.add(Activation('relu'))
And here is the unification layer, which, as discussed earlier, helps to make the image classifier more correct so that it can learn the relevant patterns. We will also describe the Dropout and Batch Normalization:
model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.2)) model.add(BatchNormalization())
This is the core of the workflow in the first part of CNN implementation: fold, activate, exclude, merge. Now you understand why we imported Dropout, BatchNormalization, Activation, Conv2d and MaxPooling2d.
You can vary the number of convolutional layers to your liking, but each adds computational cost. Note that when adding convolutional layers, you usually increase the number of filters as well so that the model can learn more complex representations. If the numbers chosen for these layers seem a little arbitrary, then just know that it is recommended to increase the filters gradually, setting the value to 2 to the power (2 ^ n), which can give a slight advantage when training the model on a GPU.
It is important not to have too many merge layers, as each one discards some of the data. Combining too often will result in tightly connected layers learning almost nothing when the data reaches them.
The number of merging layers required depends on the task at hand - this is something you will determine over time. Since the images in our set are already quite small, we won't merge them more than twice.
You can now iterate over these layers to give your network more views to work with:
model.add(Conv2D(64, (3, 3), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.2)) model.add(BatchNormalization()) model.add(Conv2D(128, (3, 3), padding='same')) model.add(Activation('relu')) model.add(Dropout(0.2)) model.add(BatchNormalization())
After we're done with the convolutional layers, we need to compress the data, so we imported the Flatten function above. We'll also add an exclusion layer again:
We now use the imported Dense function and create the first tightly coupled layer. We need to indicate the number of neurons in the dense layer. Note that the number of neurons in subsequent layers decreases, eventually approaching the same number of neurons as the classes in the dataset (10 in this case). Constraining the kernel can order the data during training, which also helps prevent overfitting. This is why we imported maxnorm earlier.
model.add(Dense(256, kernel_constraint=maxnorm(3))) model.add(Activation('relu')) model.add(Dropout(0.2)) model.add(BatchNormalization()) model.add(Dense(128, kernel_constraint=maxnorm(3))) model.add(Activation('relu')) model.add(Dropout(0.2)) model.add(BatchNormalization())
In this last layer, we equate the number of classes with the number of neurons. Each neuron represents a class, so the output of this layer will be a vector of 10 neurons, each of which stores a certain probability that the image in question belongs to its class.
Finally, the softmax activation function selects the neuron with the highest probability as its output value, assuming that the image belongs to this particular class:
Now that we have developed the model we want to use, all that remains is to compile it. Let's specify the number of epochs to train, as well as the optimizer we want to use.
The optimizer is what will tune the weights in your network to get closer to the point with the least loss. Adam's algorithm is one of the most commonly used optimizers because it gives high performance in most tasks:
epochs = 25 optimizer = 'adam'
Now let's compile the model with the selected parameters. Let's also provide a metric for the assessment.
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
We can also print a summary of the model to get an idea of the model as a whole.
A printout of the summary will give us some information:
Results: Layer (type) Output Shape Param # ================================================================= conv2d_1 (Conv2D) (None, 32, 32, 32) 896 _________________________________________________________________ activation_1 (Activation) (None, 32, 32, 32) 0 _________________________________________________________________ dropout_1 (Dropout) (None, 32, 32, 32) 0 _________________________________________________________________ batch_normalization_1 (Batch (None, 32, 32, 32) 128 _________________________________________________________________ conv2d_2 (Conv2D) (None, 32, 32, 64) 18496 _________________________________________________________________ activation_2 (Activation) (None, 32, 32, 64) 0 _________________________________________________________________ max_pooling2d_1 (MaxPooling2 (None, 16, 16, 64) 0 _________________________________________________________________ dropout_2 (Dropout) (None, 16, 16, 64) 0 _________________________________________________________________ batch_normalization_2 (Batch (None, 16, 16, 64) 256 _________________________________________________________________ conv2d_3 (Conv2D) (None, 16, 16, 64) 36928 _________________________________________________________________ activation_3 (Activation) (None, 16, 16, 64) 0 _________________________________________________________________ max_pooling2d_2 (MaxPooling2 (None, 8, 8, 64) 0 _________________________________________________________________ dropout_3 (Dropout) (None, 8, 8, 64) 0 _________________________________________________________________ batch_normalization_3 (Batch (None, 8, 8, 64) 256 _________________________________________________________________ conv2d_4 (Conv2D) (None, 8, 8, 128) 73856 _________________________________________________________________ activation_4 (Activation) (None, 8, 8, 128) 0 _________________________________________________________________ dropout_4 (Dropout) (None, 8, 8, 128) 0 _________________________________________________________________ batch_normalization_4 (Batch (None, 8, 8, 128) 512 _________________________________________________________________ flatten_1 (Flatten) (None, 8192) 0 _________________________________________________________________ dropout_5 (Dropout) (None, 8192) 0 _________________________________________________________________ dense_1 (Dense) (None, 256) 2097408 _________________________________________________________________ activation_5 (Activation) (None, 256) 0 _________________________________________________________________ dropout_6 (Dropout) (None, 256) 0 _________________________________________________________________ batch_normalization_5 (Batch (None, 256) 1024 _________________________________________________________________ dense_2 (Dense) (None, 128) 32896 _________________________________________________________________ activation_6 (Activation) (None, 128) 0 _________________________________________________________________ dropout_7 (Dropout) (None, 128) 0 _________________________________________________________________ batch_normalization_6 (Batch (None, 128) 512 _________________________________________________________________ dense_3 (Dense) (None, 10) 1290 _________________________________________________________________ activation_7 (Activation) (None, 10) 0 ================================================================= Total params: 2,264,458 Trainable params: 2,263,114 Non-trainable params: 1,344
Now we start training the model. To do this, we need to call the fit () function on the model and pass the selected parameters.
This is where SEED is used, chosen for reproducibility purposes.
numpy.random.seed(seed) model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epochs, batch_size=64)
Take a training set of 50,000 samples and a test set of 10,000 samples.
Running this piece of code will give:
Epoch 1/25 64/50000 [..............................] - ETA: 16:57 - loss: 3.1479 - acc: 0.0938 128/50000 [..............................] - ETA: 10:12 - loss: 3.0212 - acc: 0.0938 192/50000 [..............................] - ETA: 7:57 - loss: 2.9781 - acc: 0.1250 256/50000 [..............................] - ETA: 6:48 - loss: 2.8830 - acc: 0.1484 320/50000 [..............................] - ETA: 6:07 - loss: 2.8878 - acc: 0.1469 384/50000 [..............................] - ETA: 5:40 - loss: 2.8732 - acc: 0.1458 448/50000 [..............................] - ETA: 5:20 - loss: 2.8842 - acc: 0.1406 ... ... ... 49664/50000 [============================>.] - ETA: 1s - loss: 1.5160 - acc: 0.4611 49728/50000 [============================>.] - ETA: 1s - loss: 1.5157 - acc: 0.4612 49792/50000 [============================>.] - ETA: 1s - loss: 1.5153 - acc: 0.4614 49856/50000 [============================>.] - ETA: 0s - loss: 1.5147 - acc: 0.4615 49920/50000 [============================>.] - ETA: 0s - loss: 1.5144 - acc: 0.4617 49984/50000 [============================>.] - ETA: 0s - loss: 1.5141 - acc: 0.4617 50000/50000 [==============================] - 262s 5ms/step - loss: 1.5140 - acc: 0.4618 - val_loss: 1.0715 - val_acc: 0.6195 End of Epoch 1
Note that in most cases you need to have a different test set than the test set, so you must specify the percentage of the training data to be used as the test set. In this case, we'll just pass the test data in to make sure the test data is deferred and not used for training. In this example, we will only have test data to keep things simpler.
We can now evaluate the model and see how it works. Just call model.evaluate ():
# Model evaluation scores = model.evaluate(X_test, y_test, verbose=0) print("Accuracy: %.2f%%" % (scores*100))
And here we get the result:
OK it's all over Now! We now have CNN image recognition trained. Not bad for a first run, but you probably want to experiment with the model structure and parameters to try and get the best performance.
Theoretical and experimental work on CNN naturally leads to the options for using neural networks to solve practical everyday problems. The most relevant task in the field of image recognition and classification is the task of solving captcha, in particular - the most popular today Google ReCaptcha v2.
Despite the similarity with our example, in practice, implementing a working neural network for solving captcha seems to be an extremely costly and ineffective solution due to the constantly changing set of data (pictures in captcha). Such frequent and unpredictable updates of incoming data entails a whole string of problems:
- the need to regularly collect and process new data
- the need for constant monitoring of the process by a person and making changes to the model along the way (including experiments with parameters)
- the need for powerful equipment to train the model 24/7
A universal solution to the problem of bypassing various captchas online
To solve captchas in a continuous mode, at high speed and at a relatively low cost, online services for recognizing captchas are in great demand, which attract real users for this. In the domestic market, the leader is the RuCaptcha.com service, which compares favorably with its competitors:
high accuracy (up to 99%) and speed of decisions (12 seconds for regular text captchas and 24 seconds for ReCaptcha)
acceptable fixed prices (the price does not increase with an increase in the load on the service servers): 35 rubles for 1000 solutions of ordinary captchas and 160 rubles for 1000 solutions of ReCaptcha
refund for rare unsuccessful recognitions
technical ability to solve huge volumes of captchas (more than 10,000 per minute)
simple and functional API
ready-made libraries and code samples for various programming languages
* an attractive affiliate program that allows developers and referrals to receive up to 15% of the costs of attracted customers and 10% of the income of employees involved in the service.
Any questions that arise regarding the operation of the service are promptly resolved by the support service through the ticket system.
Well, now that you've implemented your first image recognition network in Keras, it would be a good idea to play with the model and see how changing its parameters affects performance.
This will give you some intuitive understanding of the optimal values for various model parameters. You should also explore the various options and hyperparameters as you work on your model. Once you are thoroughly familiar, you can try to implement your own image classifier on a different dataset.
As for routine practical work tasks, such as, captcha recognition - the creation and training of a neural network is hardly a flexible and effective solution.