STAT 436 (Spring 2023): Introduction to Feature Learning

Kris Sankaran

library(keras)

In classical machine learning, we assume that the features most relevant to prediction are already available. E.g., when we want to predict home price, we already have features about square feet and neighborhood income, which are clearly relevant to the prediction task.
In many modern problems though, we have only access to data where the most relevant features have not been directly encoded.
- In image classification, we only have raw pixel values. We may want to predict whether a pedestrian is in an image taken from a self-driving car, but we have only the pixels to work with. It would be useful to have an algorithm come up with labeled boxes like those in the examples below.
- For sentiment analysis, we want to identify whether a piece of text is a positive or negative sentiment. However, we only have access to the raw sequence of words, without any other context. Examples from the IMDB dataset are shown below.
In both of these examples, this information could be encoded manually, but it would a substantial of effort, and the manual approach could not be used in applications that are generating data constantly. In a way, the goal of these algorithms is to distill the raw data down into a succinct set of descriptors that can be used for more classical machine learning or decision making.

Figure 1: An example of the types of labels that would be useful to have, starting from just the raw image.

Example reviews from the IMDB dataset:

  positive,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari ...."
  positive,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be ..."
  negative,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to ..."
  positive,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a ..."

In these problems, the relevant features only arise as complex interactions between the raw data elements.
- To recognize a pedestrian, we need many adjacent pixels to have a particular configuration of values, leading to combinations of edges and shapes, which when viewed together, become recognizable as a person.
- To recognize a positive or negative sentiment, we need to recognize interactions between words. “The movie was good until” clearly has bad sentiment, but you cannot tell that from the isolated word counts alone.
The main idea of deep learning is to learn these more complex features one layer at a time. For image data, the first layer recognizes interactions between individual pixels. Specifically, individual features are designed to “activate” when particular pixel interactions are present. The second layer learns to recognize interactions between features in the first layer, and so on, until the learned features correspond to more “high-level” concepts, like sidewalk or pedestrian.
Below is a toy example of how an image is processed into feature activations along a sequence of layers. Each pixel within the feature maps correspond to a patch of pixels in the original image – those later in the network have a larger field of view than those early on. A pixel in a feature map has a large value if any of the image features that it is sensitive to are present within its field of vision.

A toy diagram of feature maps from the model loaded below. Early layers have fewer, but larger feature maps, while later layers have many, but small ones. The later layers typically contain higher-level concepts used in the final predictions.

Figure 2: A toy diagram of feature maps from the model loaded below. Early layers have fewer, but larger feature maps, while later layers have many, but small ones. The later layers typically contain higher-level concepts used in the final predictions.

At the end of the feature extraction process, all the features are passed into a final linear or logistic regression module that completes the regression or classification task, respectively.
It is common to refer to each feature map as a neuron. Different neurons activate when different patterns are present in the original, underlying image.

An illustration of the different spatial contexts of feature maps at different layers. An element of a feature map has a large value (orange in the picture) if the feature that it is sensitive to is present in its spatial context. Higher-level feature maps are smaller, but each pixel within it has a larger spatial context.

Figure 3: An illustration of the different spatial contexts of feature maps at different layers. An element of a feature map has a large value (orange in the picture) if the feature that it is sensitive to is present in its spatial context. Higher-level feature maps are smaller, but each pixel within it has a larger spatial context.

Below, we load a model to illustrate the concept of multilayer networks. This model has 11 layers followed by a final logistic regression layer. There are many types of layers. Each type of layer contributes in a different way to the feature learning goal, and learning how design and compose these different types of layers is one of the central concerns of deep learning. The Output Shape column describes the number and shape of feature maps associated with each layer. For example, the first layer has 32 feature maps, each of size \(148 \times 148\). Deeper parts of the network have more layers, but each is smaller. We will see how to load and inspect these features in the next lecture.

f <- tempfile()
download.file("https://uwmadison.box.com/shared/static/9wu6amgizhgnnefwrnyqzkf8glb6ktny.h5", f)
model <- load_model_hdf5(f)
model

Model: "sequential_1"
______________________________________________________________________
Layer (type)                   Output Shape                Param #    
======================================================================
conv2d_7 (Conv2D)              (None, 148, 148, 32)        896        
______________________________________________________________________
max_pooling2d_7 (MaxPooling2D) (None, 74, 74, 32)          0          
______________________________________________________________________
conv2d_6 (Conv2D)              (None, 72, 72, 64)          18496      
______________________________________________________________________
max_pooling2d_6 (MaxPooling2D) (None, 36, 36, 64)          0          
______________________________________________________________________
conv2d_5 (Conv2D)              (None, 34, 34, 128)         73856      
______________________________________________________________________
max_pooling2d_5 (MaxPooling2D) (None, 17, 17, 128)         0          
______________________________________________________________________
conv2d_4 (Conv2D)              (None, 15, 15, 128)         147584     
______________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 7, 7, 128)           0          
______________________________________________________________________
flatten_1 (Flatten)            (None, 6272)                0          
______________________________________________________________________
dropout (Dropout)              (None, 6272)                0          
______________________________________________________________________
dense_3 (Dense)                (None, 512)                 3211776    
______________________________________________________________________
dense_2 (Dense)                (None, 1)                   513        
======================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0
______________________________________________________________________

While we will only consider image data in this course, the idea of learning complex features by composing a few types of layers is a general one. For example, in sentiment analysis, the first layer learns features that activate when specific combinations of words are present in close proximity to one another. The next layer learns interactions between phrases, and later layers are responsive to more sophisticated grammar.
Deep learning is often called a black box because these intermediate features are often complex and not directly interpretable according to human concepts. The problem is further complicated by the fact that features are “distributed” in the sense that a single human concept can be encoded by a configuration of multiple features. Conversely, the same model feature can encode several human concepts.
For this reason, a literature has grown around the question of interpreting neural networks. The field relies on visualization and interaction to attempt to understand the learned representations, with the goal of increasing the safety and scientific usability of deep learning models. While our class will not discuss how to design or develop deep learning models, we will get a taste of the interpretability literature in the next few lectures.

Introduction to Feature Learning

Citation