|Date||Presenter||Topic or Paper|
Dan Hendrycks is a second-year PhD student at UC Berkeley, advised by Jacob Steinhardt and Dawn Song. His research aims to disentangle and concretize the components necessary for safe AI. This leads him to work on quantifying and improving the performance of models in unforeseen out-of-distribution scenarios, and more recently he works on machine ethics. Dan received his BS from the University of Chicago. https://twitter.com/DanHendrycks
Although ResNets and BERT achieve high accuracy on in-distribution examples, do they generalize to new distributions? In this talk I survey benchmarks in vision and NLP that measure how well models hold up when there is a discrepancy between the train and test set. The talk will draw on results in NLP from http://arxiv.org/abs/2004.06100 and recent vision results from http://arxiv.org/abs/2006.16241
|2020.08.14||Zhongqi Miao, Ziwei Liu||Open Compound Domain Adaptation|
|2020.08.21||Anna Goldie, Azalia Mirhoseini||Chip Placement with Deep Reinforcement Learning|
|2020.08.28||Arianna Ornaghi||Stereotypes in High-Stakes Decisions: Evidence from U.S. Circuit Courts|
|2020.09.04||Shrimai Prabhumoye||Controllable Text Generation|
|2020.09.11||Jesse Mu||Compositional Explanations of Neurons|
|2020.09.18||Katherine Ye||Penrose: From Mathematical Notation to Beautiful Diagrams|
|2020.09.25||Sidak Pal Singh||Model Fusion via Optimal Transport|
“A super influential reading group that has achieved cult-like status.” —John Sears
Deep Learning: Classics and Trends (DLCT) is a reading group I have been running since 2018. It started within Uber AI Labs, with the support of Zoubin, Ken and Jason, and the help of many, when we felt the need of a space to sample the overwhelmingly large amount of papers, and to hold free-form, judgemental (JK) discussions; or as Piero puts it, to “ask a million questions”.
Since then, it has grown much larger, first opened up to the broader machine learning community in Uber, then to the general public in 2019. Starting March 2020, in light of COVID-19, we hold all meetings virtually, making it radically accessible to anyone from anywhere. From June 2020, DLCT operates under ML Collective, with a mission of making researchers more connected.
To me, it’s more than reading papers and attending presentations. It has started to serve as an anchor for all of us to connect every once in a while amidst all the changes, shifts of emphasis, and chaos, in Bay Area, in AI research, and generally in this fast-paced world.
The best thing about it is the group of people that it enables to connect—seriously, the smartest and kindest researchers that I feel so lucky to have known and have worked with.
- Time: Every Friday, 12pm - 1pm Pacific Time
- Place: Virturally on Zoom (up to 100 participants)
- Format: Presentation based. An invited speaker would talk about a paper with slides, a lot of the time themselves being the author of the paper.
- Scope: Deep learning, old (a.k.a. “let’s revisit the 2014 GAN paper”) and new (a.k.a. “look at this blog post from yesterday”).
- Join the mailing list to receive weekly notifications of the upcoming talk.
- Nominate a speaker, a paper, or just tell me what you think.
If you are an ML researcher and have a paper that you are proud of: tell us (besides posting on Twitter)! This could be yet another platform for feedbacks and engagements with a community of 600+ students, scientists and ML engineers and enthusiasts.
|Date||Presenter||Topic or Paper|
Ben Mann is a Member of Technical Staff at OpenAI. He is the go-to person for data engineering, but dabbles in everything. Outside work, he blogs about a wide range of topics from ML to hiking to pooping better. One day he hopes to make superintelligent AI that is safe and beneficial for humanity.
I’ll describe our major contributions in this paper, as well as where we fell short. My work was mainly on training data, eval memorization, and the eval suite. I’ll offer deep dives on these sections.
Hanie Sedghi is a senior research scientist at Google Brain, where she leads the “Deep Phenomena” research group. Her approach is to bond theory and practice in large-scale machine learning by designing algorithms with theoretical guarantees that also work efficiently in practice. Over the recent years, she has been working on understanding deep learning phenomena and improving the training algorithms. Hanie has various publications in this area and has organized many workshops to expand the domain, such as Deep Phenomena workshop at ICML 2019 and Deep Learning Day at KDD 2020. She is an area chair at ICML, ICLR, ALT, a member of JMLR editorial board and has served as a reviewer for many prominent conferences. Hanie has mentored several junior researchers and students, and is passionate about helping people from marginalized groups. Prior to Google, she was a research scientist at Allen Institute for Artificial Intelligence and before that, a postdoctoral fellow under the supervision of professor Anima Anandkumar. Hanie got her PhD from University of Southern California with a minor in mathematics and her Masters and Bachelors at Sharif University of Technology, Iran.
We study the phenomenon that some modules of deep neural networks (DNNs) are more critical than others. Meaning that rewinding their parameter values back to initialization, while keeping other modules fixed at the trained parameters, results in a large drop in the network’s performance. Our analysis reveals interesting properties of the loss landscape which leads us to propose a complexity measure, called module criticality, based on the shape of the valleys that connect the initial and final values of the module parameters. We formulate how generalization relates to the module criticality, and show that this measure is able to explain the superior generalization performance of some architectures over others, whereas earlier measures fail to do so. I will also cover our recent results on extension to transfer learning setting, and how module criticality predicts which layers of the network play an important role for successful transfer.
Chiyuan Zhang is a research scientist at Google Research, Brain Team. He is interested in analyzing and understanding the foundations behind the effectiveness of deep learning, as well as its connection to the cognition and learning mechanisms of the human brain. He is also interested in future directions to break the data inefficiency bottleneck in most current deep learning algorithms. Chiyuan Zhang holds a Ph.D. from MIT (2017), and a Bachelor (2009) and a Master (2012) degrees in computer science from Zhejiang University, China. His work was recognized by INTERSPEECH best student paper award in 2014, and ICLR best paper award in 2017.
Deep learning algorithms are well-known to have a propensity for fitting the training data very well and often fit even outliers and mislabeled data points. Such fitting requires memorization of training data labels, a phenomenon that has attracted significant research interest but has not been given a compelling explanation so far. A recent work of Feldman [Fel19] proposes a theoretical explanation for this phenomenon based on a combination of two insights. First, natural image and data distributions are (informally) known to be long-tailed, that is have a significant fraction of rare and atypical examples. Second, in a simple theoretical model such memorization is necessary for achieving close-to-optimal generalization error when the data distribution is long-tailed. However, no direct empirical evidence for this explanation or even an approach for obtaining such evidence were given.
In this work we design experiments to test the key ideas in this theory. The experiments require estimation of the influence of each training example on the accuracy at each test example as well as memorization values of training examples. Estimating these quantities directly is computationally prohibitive but we show that closely-related subsampled influence and memorization values can be estimated much more efficiently. Our experiments demonstrate the significant benefits of memorization for generalization on several standard benchmarks. They also provide quantitative and visually compelling evidence for the theory put forth in [Fel19].
We are a team that has stories of hope and regret, things to gloat and to rant, and altogether lots of opinions about grad school experiences! And we are happy to answer your questions.
Why did you go to grad school? Looking back, what would you have done differently? What are today’s grad school students/applicants facing, and how can they be better supported? We wish to touch upon all these during the panel, and address any other questions from the public via Slido 👇👇
Aditya Kusupati is a first year CS PhD student at University of Washington jointly advised by Ali Farhadi and Sham Kakade. My broad research interests at the moment lie in the intersection of Machine Learning, Computer Vision and Robotics (Multimodal Perception, shh! it is a secret). He is currently a Research Scientist Intern at NVIDIA Toronto Lab working with Sanja Fidler and Antonio Torralba for the summer.
Sparsity in Deep Neural Networks (DNNs) is studied extensively with the focus of maximizing prediction accuracy given an overall parameter budget. Existing methods rely on uniform or heuristic non-uniform sparsity budgets which have sub-optimal layer-wise parameter allocation resulting in a) lower prediction accuracy or b) higher inference cost (FLOPs). We propose Soft Threshold Reparameterization (STR), a novel use of the soft-threshold operator on DNN weights. STR smoothly induces sparsity while learning pruning thresholds thereby obtaining a non-uniform sparsity budget. Our method achieves state-of-the-art accuracy for unstructured sparsity in CNNs (ResNet50 and MobileNetV1 on ImageNet-1K), and, additionally, learns non-uniform budgets that empirically reduce the FLOPs by up to 50%. Notably, STR boosts the accuracy over existing results by up to 10% in the ultra sparse (99%) regime and can also be used to induce low-rank (structured sparsity) in RNNs. In short, STR is a simple mechanism which learns effective sparsity budgets that contrast with popular heuristics. Code, pretrained models and sparsity budgets are at https://github.com/RAIVNLab/STR.
Jianyu Wang is a third-year PhD student at Carnegie Mellon University, advised by professor Gauri Joshi. He has worked at Facebook AI Research and Google Research as a summer intern. Previously, Jianyu received his B.Eng in Electronic Engineering from Tsinghua University in 2017. His awards and honors include the best student paper award at NeurIPS Federated Learning Workshop (2019), and Qualcomm innovation fellowship (2018).
Distributed optimization is essential for training large models on large datasets. Multiple approaches have been proposed to reduce the communication overhead in distributed training, such as synchronizing only after performing multiple local SGD steps, and decentralized methods (eg, using gossip algorithms) to decouple communications among workers. Although these methods run faster than AllReduce-based methods, which use blocking communication before every update, the resulting models may be less accurate after the same number of updates. Inspired by the BMUF method of Chen & Huo (2016), we propose a slow momentum (SlowMo) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm. Experiments on image classification and machine translation tasks demonstrate that SlowMo consistently yields improvements in optimization and generalization performance relative to the base optimizer, even when the additional overhead is amortized over many updates so that the SlowMo runtime is on par with that of the base optimizer. We provide theoretical convergence guarantees showing that SlowMo converges to a stationary point of smooth non-convex losses. Since BMUF is a particular instance of the SlowMo framework, our results also correspond to the first theoretical convergence guarantees for BMUF.
I’m currently a joint research fellow at the Gatsby Computational Neuroscience Unit and Sainsbury Wellcome Centre at UCL where I am building models of visual processing. I did my PhD at the Center for Theoretical Neuroscience at Columbia University in the lab of Ken Miller and before that I got my bachelor’s in neuroscience from the University of Pittsburgh. In addition to doing computational neuroscience research, I am also writing a popular science book on the history of the field for Bloomsbury Sigma!
Attention has been studied in psychology for over a hundred years and studies that record from neurons have aimed to understand the physical underpinnings of attentional processes for several decades. More recently, attention mechanisms have been added to artificial neural networks to enhance their performance. In this talk, I will briefly overview the study of attention in these different domains, with a focus on visual attention. I will then describe my own work using findings from neurophysiology to add feature-based attention to convolutional neural networks (CNNs). CNNs are currently some of the best models available of the primate visual system and they allow neuroscientists to probe the relationship between neural activity and task performance “in silico”. I will share how studying attention in these models can lead to a rethinking of how biological attention works.
Mark van der Wilk is a lecturer (assistant professor) at Imperial College London. He is particularly interested in how to learn appropriate inductive biases from data, instead of hand-designing them, and thinks that results from Gaussian processes may contribute to a solution. The overall aim is to make models more adaptive and data-efficient, which can be used to improve decision making and reinforcement learning.
To improve generalisation in supervised learning, it is common to encourage invariance in the solution, i.e. keeping the output relatively constant to irrelevant transformations of the input. Many techniques can be seen as introducing invariance, such as data augmentation, convolutional structure, or more general group structure.
Piero Molino is a Senior Research Scientist at Uber AI (for one more week, aka till his birthday) with focus on machine learning for language and dialogue. Piero completed a PhD on Question Answering at the University of Bari, Italy. Founded QuestionCube, a startup that built a framework for semantic search and QA. Worked for Yahoo Labs in Barcelona on learning to rank, IBM Watson in New York on natural language processing with deep learning and then joined Geometric Intelligence, where he worked on grounded language understanding. After Uber acquired Geometric Intelligence, he became one of the founding members of Uber AI Labs. At Uber he works on research topics including Dialogue Systems, Language Generation, Graph Representation Learning, Computer Vision, Reinforcement Learning and Meta Learning. He also worked on several deployed systems like COTA, an ML and NLP model for Customer Support, Dialogue Systems for driver hands free dispatch, pickup and communications, and on the Uber Eats Recommender System with graph learning. He is the author of Ludwig, a code-free deep learning toolbox backed by the Linux Foundation.
In this talk I’ll propose a historical perspective that traces the origin of current self-supervision and word embedding trends in machine learning to the structuralist ideas proposed by Ferdinand de Saussure and Ludwig Wittgenstein in the early 20th century. I will also showcase several distributional semantic models (pre deep learning approaches to learn word representations) and connect them with more modern approaches up to recent self-supervised models for language, vision and graph structured data. The intent is that by showing the origins of these ideas the audience would be better equipped to both put the current self-supervision research in perspective with respect to the broader cultural context, and learn from past research as it contained deep insights that can help inform future directions for the field.
Sebastian Risi is an Associate Professor at the IT University of Copenhagen where he co-directs the Robotics, Evolution and Art Lab (REAL). He is currently the principal investigator of a Sapere Aude: DFF Starting Grant (Innate: Adaptive Machines for Industrial Automation). He has won several international scientific awards, including multiple best paper awards, the Distinguished Young Investigator in Artificial Life 2018 award, a Google Faculty Research Award in 2019, and an Amazon Research Award in 2020. Recently he co-founded modl.ai, a company that develops AIs that can accelerate game development and enhance player engagement. More information: sebastianrisi.com
In this talk, I review a new class of genotype-to-phenotype encodings, which are not manually defined but learned from the data itself. For example, we can train a GAN on Super Mario Bros levels, allowing levels to be evolved in the latent space of a GAN that maximize desired properties such as difficulty. When the GAN is trained on a specific target domain, it becomes a compact and robust genotype-to-phenotype mapping allowing for target-based evolution. This Latent Variable Evolution (LVE) approach can also be combined with interactive evolution, allowing users to breed their own video game levels and play those discovered levels. I’ll also present our latest results on CPPN2GAN, in which a Compositional Pattern Producing Network (CPPN) can define latent vector GAN inputs as a function of geometry, which provides a way to organize level segments output by a GAN into large-scale patterns. The benefit of these data-driven encodings is that they make it easy to explore the space of high-quality solutions, for both humans and optimization algorithms.
Rowan Zellers is a 4th year PhD student at the University of Washington, working with Yejin Choi and Ali Farhadi, studying natural language processing and computer vision.
There is a fundamental gap between how humans understand and use language — in open-ended, real-world situations — and today’s NLP benchmarks for language understanding. To narrow this gap, we propose to evaluate machines by their success at real-world language use – which greatly expands the scope of language tasks that can be measured and studied.
Hattie is a data scientist and research hobbyist at Uber.
Human intelligence exhibits systematic compositionality (Fodor & Pylyshyn, 1988), the capacity to understand and produce a potentially infinite number of novel combinations of known components, i.e., to make “infinite use of finite means” (Chomsky, 1965). In the context of learning from a set of training examples, we can observe compositionality as compositional generalization, which we take to mean the ability to generalize to composed test examples from one distribution after being exposed to the necessary components during training on a different distribution.
|2020.04.17||Nikhil Dev Deshmudre||[AlphaGo], [AlphaGo Zero], [Alpha Zero], [MuZero] [Slides]|
Nikhil is an engineer at Uber ATG (Uber’s self driving group). He works on increasing the realism of simulators for self driving cars. His primary focus is on road actor behavior simulation and sensor simulation.
In this talk, I’ll trace the evolution of the main ideas in Deepmind’s Go playing ML work you’ve surely heard of. We’ll start with the original model free AlphaGo paper and work our way through to the recent model based MuZero.
|2020.04.03||Alyssa Dayan||Mode-Adaptive Neural Networks for Quadruped Motion Control [Slides]|
|2020.03.27||Michela Paganini||Empirical Observations in Pruned Networks & Tools for Reproducible Pruning Research|
|2020.03.20||Rapha Gontijo Lopes||Affinity and Diversity: Quantifying Mechanisms of Data Augmentation [Slides] [Recording]|
|2020.03.13||Ian Thompson||A Good View Is All You Need: Deep InfoMax (DIM) and Augmented Multiscale Deep InfoMax (AMDIM) [Slides] [Recording]|
|2020.02.28||Ashley Edwards||Estimating Q(s, s’) with Deep Deterministic Dynamics Gradients [Slides]|
|2020.02.14||Xinchen Yan||Conditional generative modeling and adversarial learning|
|2020.02.07||Yaroslav Bulatov||einsum is all you need [Slides] [Recording]|
|2020.01.31||Rosanne Liu||Selective Brain Damage: Measuring the Disparate Impact of Model Pruning|
|2020.01.24||Jeff Coggshall||ReMixMatch and FixMatch|
|2020.01.17||Rosanne Liu||Improving sample diversity of a pre-trained, class-conditional GAN by changing its class embeddings [Slides] [Recording]|
|2020.01.10||Zhuoyuan Chen||Why Build an Assistant in Minecraft?|
|2019.11.22||Rosanne Liu||On the “steerability” of generative adversarialnetworks [Slides] [Recording]|
|2019.11.15||Polina Binder||Learning Deep Sigmoid Belief Networks with Data Augmentation|
|2019.11.08||Sanyam Kapoor||Policy Search & Planning: Unifying Connections |
|2019.11.01||Chris Olah||Zoom in: Features and circuits as the basic unit of neural networks|
|2019.10.25||Renjie Liao||Efficient Graph Generation with Graph Recurrent Attention Networks|
|2019.10.18||Nitish Shirish Keskar, Bryan McCann||CTRL: A Conditional Transformer Language Model for Controllable Generation|
|2019.10.11||Subutai Ahmad||Sparsity in the neocortex, and its implications for machine learning|
|2019.10.04||Eli Bingham||Multiple Causes: A Causal Graphical View|
|2019.09.27||Xinyu Hu||Learning Representations for Counterfactual Inference|
|2019.09.04||Jonathan Frankle||The Latest Updates on the Lottery Ticket Hypothesis|
|2019.08.23||Ankit Jain||Knowledge-aware Graph Neural Networks with Label Smoothness Regularization for Recommender Systems [Slides]|
|2019.08.16||Jiale Zhi||Meta-Learning Neural Bloom Filters|
|2019.08.16||Ted Moskovitz||Lookahead Optimizer: k steps forward, 1 step back|
|2019.07.26||Rui Wang||Off-Policy Evaluation for Contextual Bandits and RL |
|2019.07.19||Rosanne Liu||Weight Agnostic Neural Networks [Slides] [Recording]|
|2019.07.12||Joost Huizinga||A Distributional Perspective on Reinforcement Learning|
|2019.06.28||Ashley Edwards||[ICML Preview] Learning Values and Policies from Observation |
|2019.06.21||Stanislav Fořt||[ICML Preview] Large Scale Structure of Neural Network Loss Landscapes|
|2019.06.07||Joey Bose||[ICML Preview] Compositional Fairness Constraints for Graph Embeddings|
|2019.05.31||Yulun Li||IntentNet: Learning to Predict Intention from Raw Sensor Data|
|2019.05.24||Thomas Miconi, Rosanne Liu, Janice Lan||ICLR Recap, cont.|
|2019.05.17||Aditya Rawal, Jason Yosinski||ICLR Recap|
|2019.04.26||JP Chen||3D-Aware Scene Manipulation via Inverse Graphics [Slides]|
|2019.04.19||Felipe Petroski Such||Relational Deep Reinforcement Learning|
|2019.04.12||Piero Molino, Jason Yosinski||Open mic|
|2019.04.05||Joel Lehman||The copycat project: A model of mental fluidity and analogy-making|
|2019.03.29||Rosanne Liu||Non-local Neural Networks [Slides]|
|2019.03.22||Yariv Sadan||Learning deep representations by mutual information estimation and maximization [Slides]|
|2019.03.15||Chandra Khatri||Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation|
|2019.03.01||Nikhil Dev Deshmudre||BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding|
|2019.02.22||Vashisht Madhavan||Neural Turing Machines|
|2019.02.01||Jiale Zhi||Non-delusional Q-learning and value iteration|
|2019.01.25||Yulun Li||Relational Recurrent Neural Networks|
|2019.01.18||Rui Wang||Neural Ordinary Differential Equations|
|2019.01.11||Jonathan Simon||Generating Humorous Portmanteaus using Word Embeddings  [Slides]|
|2018.12.21||Christian Perez||Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles [Slides]|
|2018.12.14||Alexandros Papangelis||Two trends in dialog |
|2018.10.26||Aditya Rawal||Stochastic Weight Averaging [Slides]|
|2018.10.12||Mahdi Namazifar||Troubling Trends in Machine Learning Scholarship|
|2018.09.28||Yariv Sadan||MINE: Mutual Information Neural Estimation [Slides]|
|2018.09.21||Jan-Matthis Lueckmann||Glow and RealNVP [Slides]|
|2018.09.14||Jane Hung||The YOLO series: v1, v2, v3|
|2018.09.07||Rosanne Liu||Pooling is Neither Necessary nor Sufficient for Appropriate Deformation Stability in CNNs [Slides]|
|2018.08.31||Alican Bozkur||Multimodal Unsupervised Image-to-Image Translation [Slides]|
|2018.08.24||Janice Lan||The Lottery Ticket Hypothesis: Finding Small, Trainable Neural Networks [Slides]|
|2018.08.17||Yariv Sadan||Opening the black box of Deep Neural Networks via Information [Slides]|
|2018.08.10||Joost Huizinga||Learning to Reinforcement Learn, and RL 2: Fast Reinforcement Learning via Slow Reinforcement Learning [Slides]|
|2018.08.03||JP Chen||Deep Convolutional Inverse Graphics Network [Slides]|
|2018.07.27||Lei Shu||Attention is all you need [Slides]|
|2018.07.06||Neeraj Pradhan||Auto encoding Variational Bayes, and ELBO|
|2018.06.29||Ankit Jain||Dynamic Routing Between Capsules [Slides]|
|2018.06.22||Xinyu Hu||Self Normalizing Neural Networks [Slides]|
|2018.06.15||John Sears||The Decline and Fall of Adam:  [Slides]|
|2018.06.08||Alex Gajewski||GANs, etc.  [Slides]|
|2018.06.01||Jason Yosinski||Sensitivity and Generalization in Neural Networks: an Empirical Study [Slides]|
Congratulations! Now that you have scrolled all the way down here, you get the reward of reading more text about the scope and vision of this reading group.
Q: What was the initial idea of organizing a reading group like this?
A: It started with the rather selfish idea that I wanted to know about papers that I don’t have time to read, and learn about topics my individual intelligence limits me from fully understanding. Besides, I enjoy being around people that are smarter and more knowledgable than me, faster than me working out twelve math equations on one slide, braver than me to ask stupid questions, and more patient than me answering them, as well as those who value great presentations as much as I do.
Q: How much work is it for you?
A: I never travel on Fridays now.
Q: Where do you see it going?
A: I envision building a community where people work hard to tell science stories well. Each paper is a story. A great paper, apart from solid results and technical and scientific advances, stands out particularly in the way it tells the story. I hope we all value storytelling and talk-giving slightly more than we do now. This ties to an eventual wish that scientific writing moves towards being lucid and understandable. This reading group is a start.
Here is how I see different levels of storytelling, in the format of an one-hour presentation, could happen in this group.
You can give a Level 0 talk, which is going through someone else’s paper—the storyline is already there. This is perhaps the most basic and involves the least work: you just need to understand it and retell it to others. (I assume as a researcher you already read papers, and this additional work of making it into a presentation would only help you understand it better yourself.) And best of all, when the audience asks hard questions, you can just say “I don’t know—not my work.”
A Level 1 talk, could mean presenting one of your own papers. The bar is higher because you are expected to know every detail of the project, but also lower because you probably already do. And a good background coverage to lead to the exact problem and idea always helps.
Then we have Level 2 talks, which are usually a topic formed by understanding a field (however small it is) thoroughly well, and having in mind a hierarchical chart or spiderweb of a number of fields leading to that particular one. You might be citing multiple papers, drawing connections and coming up with conclusions that are mainly your own.
Q: Do you have a high bar for talks given there?
A: Yes I do. But I also know we all have to start somewhere. And I myself was a horrible presenter not too long ago (likely still am). But we all get better.
2020-06-19 00:00:00 +0000 UTC
2020-06-11 00:00:00 +0000 UTC