class: title background-size: cover <div id="title"> Visualization in Deep Learning: <br/> Theme and Variations </div> <div id="links"> Slides: <a href="https://go.wisc.edu/9p83o9">https://go.wisc.edu/9p83o9</a> </div> <br/> <br/> .center[ <img src="figures/svcca_training.gif"/> ] <div id="subtitle"> Kris Sankaran <br/> <a href="https://go.wisc.edu/pgb8nl">go.wisc.edu/pgb8nl</a> <br/> 16 | November | 2023 <br/> Machine Learning Lunch Meetings </div> --- ### Audience Question What kinds of visualizations do you use for your research? Why do you make them? [**https://go.wisc.edu/z03w7z**](https://go.wisc.edu/z03w7z) --- ### Variations Visualization can help: 1. Summarize training dynamics. 2. Reason about errors and predictions. 3. Describe memory and representation learning mechanisms. These abstractions help with: 1. Training and improving real-world models. (1, 2) 2. Guiding research through improved mental models. (1, 3) 3. Tying models into broader scientific discussion. (3) --- ### Variations I’ll be drawing examples mainly from two projects: .pull-left[ 2. **Glacier Ecosystem Mapping**: Application of satellite image segmentation to climate change adaptation and disaster preparedness. 3. Remembrances of States Past: A visual analysis of "time warping" in sequence models. ] .pull-right[ <img src="figures/GL082082E30346N.png" width="100%"/> ] --- ### Variations I’ll be drawing examples mainly from two projects: .pull-left[ 2. Glacier Ecosystem Mapping: Application of satellite image segmentation to climate change adaptation and disaster preparedness. 3. **Remembrances of States Past**: A visual analysis of "time warping" in sequence models. ] .pull-right[ <iframe width="100%" height="800" frameborder="0" src="https://observablehq.com/embed/@krisrs1128/remembrances-of-states-past@1310?cells=chart9"></iframe> ] --- ## Visualizing Training Dynamics --- ### Loss Curves .pull-left[ These visualizations are easy to make and highlight whether the model is over/underfitting. This has immediate consequences for model architecture, optimization hyperparameters, and regularization. ] .pull-right[ <img src="figures/loss_example.png" width=500/> Figure from . ] --- ### Example: Counting Crossings In this toy problem, we apply a sequence model (with gated recurrent units) to count the number of times a curve has crossed the grey band. .center[ <iframe width="800" height="484" frameborder="0" src="https://observablehq.com/embed/@krisrs1128/remembrances-of-states-past?cells=chart"></iframe> ] --- ### Example: Counting Crossings The example dataset is a collection of labeled pairs: * `\(\mathbf{x}_{i} \in \mathbb{R}^{200}\)`: A random trajectory stored as a long vector. * `\(y_i\)`: The number of times the trajectory crosses the interval `\(\left[0, 1\right]\)`. .center[ <iframe width="100%" height="400" frameborder="0" src="https://observablehq.com/embed/@krisrs1128/remembrances-of-states-past@1310?cells=chart9"></iframe> ] --- ### Loss Curves It's interesting to visualize the evolution of instance-level errors, especially examples that remain difficult to predict late in training. .center[ <iframe width="950" height="500" frameborder="0" src="https://observablehq.com/embed/@krisrs1128/remembrances-of-states-past@1311?cells=chart11"></iframe> ] --- ### <span style="col: red;">Dynamic Linking</span> This visualization applies dynamic linking [1]. By coordinating interaction across panels, we can show several views of the same data.
--- ### Activations and Gradients .pull-left[ More than losses, it can be worthwhile to visualize feature maps and gradients throughout the training process. These are activations from a UMAP model trained to segment glaciers [2]. The skip connections prevented the deepest layer in the architecture from ever being learned! ] .pull-right[ <img src="figures/glacier_representations.png" width=460/> ] --- ### Dimensionality Reduction .pull-left[ 1. Dimensionality reduction can shed light on training dynamics for the full model. 2. This figure from [3] shows that unsupervised pretraining acts like a regularizer. It's also interesting because it works on function (not the parameter) space. ] .pull-right[ <img src="figures/unsupervised-pretraining.png" width=550/> ] --- ## Visualizing Errors and Predictions --- ### Visualizing Errors We fit satellite image segmentation models to datasets on building footprints [4]. Here are examples that were flagged as being poor quality. .center[ <img src="figures/img_6560_kitsap4.png" width=250/> <img src="figures/pred_6560_kitsap4.png" width=250/> <img src="figures/mask_6560_kitsap4.png" width=250/> ] In both cases, our worst examples are due to label noise. Can you tell what happened? --- For problems where each sample is associated with a continuous accuracy measure, we can look at representatives from across the continuum. .center[ <img src="figures/glacial_lakes_grid.gif" width=700/> ] Different models can have different failure modes, and this kind of visualization compactly represents that. --- <span style="color: #D93611"/>Small Multiples</span> In visualization, repeating a view across many parallel instances is called small multiples. This creates information-dense views. .center[ <img src="figures/small-multiples.jpeg" width=850/> ] --- <span style="color: #D93611"/>Data-to-Ink</span> .pull-left[ We removed extraneous plot elements (e.g., unnecessary tick marks and grid lines). This aligns with the goal of maximizing the data-to-ink ratio . ] .pull-right[ <img src="figures/glacial_lakes_grid.gif"/> ] --- ### Smoothed Error Rates We applied PCA to the activations of the bottleneck layer in the building footprint segmentation model. The background shows smoothed error rates across training examples. .center[ <iframe src="https://adrijanik.github.io/unet-vis#div_main" width=1000 height=350/> ] --- ### Navigating Predictions .pull-left[ How might a model's predictions guide decision-making? We worked on a project with ICIMOD to identify regions with rapidly growing lakes. This is a risk factor for glacial lake outbursts. ] .pull-right[ <img src="figures/nyt_glacial_lakes.png" width=450/> ] --- ### Navigating Predictions We fit trends of estimated areas across the predicted segmentation maps. This volcano plot shows those that need more proactive monitoring. .center[ <img src="figures/volcano_plot.gif" width=640/> ] --- ### Navigating Predictions We fit trends of estimated areas across the predicted segmentation maps. This volcano plot shows those that need more proactive monitoring. .center[ <img src="figures/sanka8-3215722-large.gif" width=1000/> ] --- ## Visualizing Representations --- ### Learning Abstractions .pull-left[ A central goal of deep learning is to automatically learn higher-level abstractions from data. What visualizations can help us gauge progress? ] .pull-right[ <img src="figures/semantic-representation.png" width=230/> Figure from [5] ] --- ### The role of visualization? Deep learning models should sense higher-order abstractions. We can evaluate this by analyzing their learned representations. 1. How do architectural components compare? 2. Why do training practices like transfer learning and normalization help? 3. How do data-driven representations relate to concepts designed by human experts? --- ### Parking lot or…? Back to the building footprint labeling task, here are two images we found with similar feature activations. .pull-left[ <img src="figures/cars.png" width=400/> ] .pull-right[ <img src="figures/graves.png" width=400/> ] --- ### Visualizing LSTMs The classic paper [6] looked at feature activations in character-level sequence models. It discovered representations related to sequence position and code properties, for example. <img src="figures/lstm-activations-1.png"/> --- ### Visualizing LSTMs This paper helped demystify the mechanics of LSTM models. We could see how gating prevented important pieces of memory from being overwritten over long stretches of text. <img src="figures/lstm-activations-2.png"/> --- ### Counting Model Let’s look at features from the counting model. The first few layers encode the general `\(y\)`-value of the trajectory. Later ones focus on crossings. <iframe width="100%" height="484" frameborder="0" src="https://observablehq.com/embed/@krisrs1128/remembrances-of-states-past@1324?cells=chart7"></iframe> --- ### Comparing Representations To compare high-dimensional representations, we need to measure multivariate association. Popular choices are [7; 8], though also note [9]. <img src="figures/cka.png"/> A CKA representation analysis contrasting ViT and ResNet representations across depths, from [10]. --- ### Comparing Representations To compare high-dimensional representations, we need to measure multivariate association. Popular choices are [7; 8], though also note [9]. .center[ <img src="figures/cca_angle.png" width=500/> ] These contrast the column spaces of feature activation matrices. --- ### Learned Representations in Science Scientific foundation models are gaining prominence, and they are often accompanied by dimensionality reduction plots. Can we make something that encourages more precise discourse? .center[ <iframe src="https://esmatlas.com/explore" width=950 height=350/> ] --- ### Learned Representations in Science Scientific foundation models are gaining prominence, and they are often accompanied by dimensionality reduction plots. Can we make something that encourages more precise discourse? .center[ <img src="figures/scgpt-atlas.png" width=700/> ] --- ### Intriguing Experiment .pull-left[ [11; 12] presented connections between linguistics and BERT embeddings. For example, they found that embedding and formal parse-tree distances were closely related. ] .pull-right[ <img src="figures/parse-tree.png" width=500/> ] --- ### Intriguing Experiment They also saw how the embeddings reflect word sense disambiguation and built an app to query different words interactively. .center[ <img src="figures/word-sense.png" width=840/> ] --- ### General Lessons? I like how these visualizations: (1) Encourage readers to engage with existing mental models. (2) Provide domain-relevant context for interacting with learned representations. This seems like a promising way to balance efficiency and agency in theory building -- a way to avoid the "kaggleization of science" [13]. --- ### Multi-omics Analog How might these ideas play out in multi-omics foundation models? * Gene regulatory networks <—> parse trees. Both provide simple abstractions for reasoning about complex processes. * Gene-sense disambiguation. A protein’s purpose can depend on its cellular surroundings, and foundation models may have learned to represent this. .center[ <img src="figures/go-graph.png"/> ] --- ### Conclusion I hope that you have learned a few visualization ideas that can help your research and collaborations. Some final thoughts: 1. Visualization bridges human and machine representations. 1. Training, evaluation, and representation analysis all provide opportunities for thoughtful visualization. --- ### References [1] B. Shneiderman. "The eyes have it: a task by data type taxonomy for information visualizations". In: _Proceedings 1996 IEEE Symposium on Visual Languages_ (1996), pp. 336-343. <https://api.semanticscholar.org/CorpusID:2281975>. [2] M. Zheng, X. Miao, and K. Sankaran. "Interactive Visualization and Representation Analysis Applied to Glacier Segmentation". In: _ISPRS Int. J. Geo Inf._ 11 (2021), p. 415. [3] D. Erhan, A. C. Courville, Y. Bengio, et al. "Why Does Unsupervised Pre-training Help Deep Learning?" In: _International Conference on Artificial Intelligence and Statistics_. 2010. --- ### References [4] A. Janik, K. Sankaran, and A. Ortiz. "Interpreting Black-Box Semantic Segmentation Models in Remote Sensing Applications". In: _MLVis@EuroVis_. 2019. <https://api.semanticscholar.org/CorpusID:201139207>. [5] Y. Bengio. "Learning Deep Architectures for AI". In: _Found. Trends Mach. Learn._ 2 (2007), pp. 1-127. <https://api.semanticscholar.org/CorpusID:207178999>. [6] A. Karpathy, J. Johnson, and L. Fei-Fei. "Visualizing and Understanding Recurrent Networks". In: _ArXiv_ abs/1506.02078 (2015). <https://api.semanticscholar.org/CorpusID:988348>. [7] A. Saha, A. Bialkowski, and S. Khalifa. "Distilling Representational Similarity using Centered Kernel Alignment (CKA)". In: _British Machine Vision Conference_. 2022. <https://api.semanticscholar.org/CorpusID:256902315>. --- ### References [8] M. Raghu, J. Gilmer, J. Yosinski, et al. "SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability". In: _Neural Information Processing Systems_. 2017. <https://api.semanticscholar.org/CorpusID:23890457>. [9] J. Josse and S. P. Holmes. "Measuring multivariate association and beyond." In: _Statistics surveys_ 10 (2016), pp. 132-167 . <https://api.semanticscholar.org/CorpusID:207137323>. [10] M. Raghu, T. Unterthiner, S. Kornblith, et al. "Do Vision Transformers See Like Convolutional Neural Networks?" In: _Neural Information Processing Systems_. 2021. <https://api.semanticscholar.org/CorpusID:237213700>. --- ### References [11] A. Coenen, E. Reif, A. Yuan, et al. "Visualizing and Measuring the Geometry of BERT". In: _ArXiv_ abs/1906.02715 (2019). <https://api.semanticscholar.org/CorpusID:174802633>. [12] J. Hewitt and C. D. Manning. "A Structural Probe for Finding Syntax in Word Representations". In: _North American Chapter of the Association for Computational Linguistics_. 2019. <https://api.semanticscholar.org/CorpusID:106402715>. [13] _Fireside Chat with Christopher Manning_. <https://www.microsoft.com/en-us/research/video/fireside-chat-with-christopher-manning/>. Accessed: 2023-11-14. --- ### References --- ### References --- ### Navigating Predictions We implemented a Shiny App to look up images from lakes with interesting trends. <iframe width=1000 src="https://krisrs1128.shinyapps.io/glacial_lake_visualization/" height=450/> --- ### <span style="col: #D93611;">Focus-plus-Context</span> This is an instance of the focus-plus-context principle [14]. The idea is to let the reader zoom into patterns of interest without losing relevant context. .center[ <img src="figures/doitree_dmoz.gif" width=800/> ] --- ### Visualizing GRU Mechanics `\begin{align*} {\color{#9955bb}h_{t}} &= \left(1 - {\color{#ffba00}z_t}\right) \circ {\color{#ff9966}{h_{t - 1}}} + {\color{#ffba00}z_{t}} \circ \tilde{h}_{t} \\ {\color{#ffba00}{z_t}} &= \sigma\left(W_z {\color{#ab294d}{x_t}} + U_z {\color{#ff9966}{h_{t - 1}}}\right) \\ \tilde{h}_{t} &= \tanh\left({\color{#298eab}W}{\color{#ab294d}{x_t}} + {\color{#29ab87}{U}}\left({\color{#ffa6c9}r_t} \circ {\color{#ff9966}h_{t - 1}}\right)\right) \end{align*}` <iframe width="100%" height="409" frameborder="0" src="https://observablehq.com/embed/@krisrs1128/remembrances-of-states-past?cells=chart5"></iframe>