An introduction to compositional feature learning.
In classical machine learning, we assume that the features most relevant to prediction are already available. E.g., when we want to predict home price, we already have features about square feet and neighborhood income, which are clearly relevant to the prediction task.
In many modern problems though, we have only access to data where the most relevant features have not been directly encoded.
In both of these examples, this information could be encoded manually, but it would a substantial of effort, and the manual approach could not be used in applications that are generating data constantly. In a way, the goal of these algorithms is to distill the raw data down into a succinct set of descriptors that can be used for more classical machine learning or decision making.
 
Figure 1: An example of the types of labels that would be useful to have, starting from just the raw image.
Example reviews from the IMDB dataset:
  positive,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari ...."
  positive,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be ..."
  negative,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to ..."
  positive,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a ..."In these problems, the relevant features only arise as complex interactions between the raw data elements.
The main idea of deep learning is to learn these more complex features one layer at a time. For image data, the first layer recognizes interactions between individual pixels. Specifically, individual features are designed to “activate” when particular pixel interactions are present. The second layer learns to recognize interactions between features in the first layer, and so on, until the learned features correspond to more “high-level” concepts, like sidewalk or pedestrian.
Below is a toy example of how an image is processed into feature activations along a sequence of layers. Each pixel within the feature maps correspond to a patch of pixels in the original image – those later in the network have a larger field of view than those early on. A pixel in a feature map has a large value if any of the image features that it is sensitive to are present within its field of vision.
 
Figure 2: A toy diagram of feature maps from the model loaded below. Early layers have fewer, but larger feature maps, while later layers have many, but small ones. The later layers typically contain higher-level concepts used in the final predictions.
At the end of the feature extraction process, all the features are passed into a final linear or logistic regression module that completes the regression or classification task, respectively.
It is common to refer to each feature map as a neuron. Different neurons activate when different patterns are present in the original, underlying image.
 
Figure 3: An illustration of the different spatial contexts of feature maps at different layers. An element of a feature map has a large value (orange in the picture) if the feature that it is sensitive to is present in its spatial context. Higher-level feature maps are smaller, but each pixel within it has a larger spatial context.
f <- tempfile()
download.file("https://uwmadison.box.com/shared/static/9wu6amgizhgnnefwrnyqzkf8glb6ktny.h5", f)
model <- load_model_hdf5(f)
modelModel: "sequential_1"
______________________________________________________________________
 Layer (type)                  Output Shape                Param #    
======================================================================
 conv2d_7 (Conv2D)             (None, 148, 148, 32)        896        
 max_pooling2d_7 (MaxPooling2  (None, 74, 74, 32)          0          
 D)                                                                   
 conv2d_6 (Conv2D)             (None, 72, 72, 64)          18496      
 max_pooling2d_6 (MaxPooling2  (None, 36, 36, 64)          0          
 D)                                                                   
 conv2d_5 (Conv2D)             (None, 34, 34, 128)         73856      
 max_pooling2d_5 (MaxPooling2  (None, 17, 17, 128)         0          
 D)                                                                   
 conv2d_4 (Conv2D)             (None, 15, 15, 128)         147584     
 max_pooling2d_4 (MaxPooling2  (None, 7, 7, 128)           0          
 D)                                                                   
 flatten_1 (Flatten)           (None, 6272)                0          
 dropout (Dropout)             (None, 6272)                0          
 dense_3 (Dense)               (None, 512)                 3211776    
 dense_2 (Dense)               (None, 1)                   513        
======================================================================
Total params: 3453121 (13.17 MB)
Trainable params: 3453121 (13.17 MB)
Non-trainable params: 0 (0.00 Byte)
______________________________________________________________________While we will only consider image data in this course, the idea of learning complex features by composing a few types of layers is a general one. For example, in sentiment analysis, the first layer learns features that activate when specific combinations of words are present in close proximity to one another. The next layer learns interactions between phrases, and later layers are responsive to more sophisticated grammar.
Deep learning is often called a black box because these intermediate features are often complex and not directly interpretable according to human concepts. The problem is further complicated by the fact that features are “distributed” in the sense that a single human concept can be encoded by a configuration of multiple features. Conversely, the same model feature can encode several human concepts.
For this reason, a literature has grown around the question of interpreting neural networks. The field relies on visualization and interaction to attempt to understand the learned representations, with the goal of increasing the safety and scientific usability of deep learning models. While our class will not discuss how to design or develop deep learning models, we will get a taste of the interpretability literature in the next few lectures.