An introduction to compositional feature learning.
In classical machine learning, we assume that the features most relevant to prediction are already available. E.g., when we want to predict home price, we already have features about square feet and neighborhood income, which are clearly relevant to the prediction task.
In many modern problems though, we have only access to data where the most relevant features have not been directly encoded.
In both of these examples, this information could be encoded manually, but it would a substantial of effort, and the manual approach could not be used in applications that are generating data constantly. In a way, the goal of these algorithms is to distill the raw data down into a succinct set of descriptors that can be used for more classical machine learning or decision making.
Example reviews from the IMDB dataset:
positive,"A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari ...."
positive,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be ..."
negative,"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to ..."
positive,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situations we encounter. <br /><br />This being a ..."
In these problems, the relevant features only arise as complex interactions between the raw data elements.
The main idea of deep learning is to learn these more complex features one layer at a time. For image data, the first layer recognizes interactions between individual pixels. Specifically, individual features are designed to “activate” when particular pixel interactions are present. The second layer learns to recognize interactions between features in the first layer, and so on, until the learned features correspond to more “high-level” concepts, like sidewalk or pedestrian.
Below is a toy example of how an image is processed into feature activations along a sequence of layers. Each pixel within the feature maps correspond to a patch of pixels in the original image – those later in the network have a larger field of view than those early on. A pixel in a feature map has a large value if any of the image features that it is sensitive to are present within its field of vision.
At the end of the feature extraction process, all the features are passed into a final linear or logistic regression module that completes the regression or classification task, respectively.
It is common to refer to each feature map as a neuron. Different neurons activate when different patterns are present in the original, underlying image.
f <- tempfile()
download.file("https://uwmadison.box.com/shared/static/9wu6amgizhgnnefwrnyqzkf8glb6ktny.h5", f)
model <- load_model_hdf5(f)
model
Model
Model: "sequential_1"
______________________________________________________________________
Layer (type) Output Shape Param #
======================================================================
conv2d_7 (Conv2D) (None, 148, 148, 32) 896
______________________________________________________________________
max_pooling2d_7 (MaxPooling2D) (None, 74, 74, 32) 0
______________________________________________________________________
conv2d_6 (Conv2D) (None, 72, 72, 64) 18496
______________________________________________________________________
max_pooling2d_6 (MaxPooling2D) (None, 36, 36, 64) 0
______________________________________________________________________
conv2d_5 (Conv2D) (None, 34, 34, 128) 73856
______________________________________________________________________
max_pooling2d_5 (MaxPooling2D) (None, 17, 17, 128) 0
______________________________________________________________________
conv2d_4 (Conv2D) (None, 15, 15, 128) 147584
______________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 7, 7, 128) 0
______________________________________________________________________
flatten_1 (Flatten) (None, 6272) 0
______________________________________________________________________
dropout (Dropout) (None, 6272) 0
______________________________________________________________________
dense_3 (Dense) (None, 512) 3211776
______________________________________________________________________
dense_2 (Dense) (None, 1) 513
======================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0
______________________________________________________________________
While we will only consider image data in this course, the idea of learning complex features by composing a few types of layers is a general one. For example, in sentiment analysis, the first layer learns features that activate when specific combinations of words are present in close proximity to one another. The next layer learns interactions between phrases, and later layers are responsive to more sophisticated grammar.
Deep learning is often called a black box because these intermediate features are often complex and not directly interpretable according to human concepts. The problem is further complicated by the fact that features are “distributed” in the sense that a single human concept can be encoded by a configuration of multiple features. Conversely, the same model feature can encode several human concepts.
For this reason, a literature has grown around the question of interpreting neural networks. The field relies on visualization and interaction to attempt to understand the learned representations, with the goal of increasing the safety and scientific usability of deep learning models. While our class will not discuss how to design or develop deep learning models, we will get a taste of the interpretability literature in the next few lectures.