Trapit Bansal

Publications

Click the titles for more information about the work and to access paper/talk/code/data.

Conferences

[New] Diverse Distributions of Self-Supervised Tasks for Meta-Learning in NLP. Trapit Bansal, Karthick Gunasekaran, Tong Wang, Tsendsuren Munkhdalai, Andrew McCallum. In Empirical Methods in Natural Language Processing (EMNLP), 2021.

Abstract
Meta-learning considers the problem of learning an efficient learning process that can leverage its past experience to accurately solve new tasks. However, the efficacy of meta-learning crucially depends on the distribution of tasks available for training, and this is often assumed to be known a priori or constructed from limited supervised datasets. In this work, we aim to provide task distributions for meta-learning by considering self-supervised tasks automatically proposed from unlabeled text, to enable large-scale meta-learning in NLP. We design multiple distributions of self-supervised tasks by considering important aspects of task diversity, difficulty, type, domain, and curriculum, and investigate how they affect meta-learning performance. Our analysis shows that all these factors meaningfully alter the task distribution, some inducing significant improvements in downstream few-shot accuracy of the meta-learned models. Empirically, results on 20 downstream tasks show significant improvements in few-shot learning -- adding up to +4.2% absolute accuracy (on average) to the previous unsupervised meta-learning method, and perform comparably to supervised methods on the FewRel 2.0 benchmark.

Paper, Code / Data(Coming Soon)
Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks. Trapit Bansal, Rishikesh Jha, Tsendsuren Munkhdalai, Andrew McCallum. In Empirical Methods in Natural Language Processing (EMNLP), 2020. (Oral)

Abstract
Self-supervised pre-training of transformer models has revolutionized NLP applications. Such pre-training with language modeling objectives provides a useful initial point for parameters that generalize well to new tasks with fine-tuning. However, fine-tuning is still data inefficient --- when there are few labeled examples, accuracy can be low. Data efficiency can be improved by optimizing pre-training directly for future fine-tuning with few examples; this can be treated as a meta-learning problem. However, standard meta-learning techniques require many training tasks in order to generalize; unfortunately, finding a diverse set of such supervised tasks is usually difficult. This paper proposes a self-supervised approach to generate a large, rich, meta-learning task distribution from unlabeled text. This is achieved using a cloze-style objective, but creating separate multi-class classification tasks by gathering tokens-to-be blanked from among only a handful of vocabulary terms. This yields as many unique meta-training tasks as the number of subsets of vocabulary terms. We meta-train a transformer model on this distribution of tasks using a recent meta-learning framework. On 17 NLP tasks, we show that this meta-training leads to better few-shot generalization than language-model pre-training followed by finetuning. Furthermore, we show how the self-supervised tasks can be combined with supervised tasks for meta-learning, providing substantial accuracy gains over previous supervised meta-learning.

Paper, Talk, Code
Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks. Trapit Bansal, Rishikesh Jha, Andrew McCallum. In International Conference on Computational Linguistics (COLING), 2020. (Oral)

Abstract
Pre-trained transformer models have shown enormous success in improving performance on several downstream tasks. However, fine-tuning on a new task still requires large amounts of task-specific labeled data to achieve good performance. We consider this problem of learning to generalize to new tasks with a few examples as a meta-learning problem. While meta-learning has shown tremendous progress in recent years, its application is still limited to simulated problems or problems with limited diversity across tasks. We develop a novel method, LEOPARD, which enables optimization-based meta-learning across tasks with different number of classes, and evaluate different methods on generalization to diverse NLP classification tasks. LEOPARD is trained with the state-of-the-art transformer architecture and shows better generalization to tasks not seen at all during training, with as few as 4 examples per label. Across 17 NLP tasks, including diverse domains of entity typing, natural language inference, sentiment analysis, and several other text classification tasks, we show that LEOPARD learns better initial parameters for few-shot learning than self-supervised pre-training or multi-task training, outperforming many strong baselines, for example, yielding 14.6% average relative gain in accuracy on unseen tasks with only 4 examples per label.

Paper, Code, Talk
Simultaneously Linking Entities and Extracting Relations from Biomedical Text without Mention-level Supervision. Trapit Bansal, Pat Verga, Neha Choudhary, Andrew McCallum. In Association for the Advancement of Artificial Intelligence (AAAI), 2020. (Oral)

Abstract
Understanding the meaning of text often involves reasoning about entities and their relationships. This requires identifying textual mentions of entities, linking them to a canonical concept, and discerning their relationships. These tasks are nearly always viewed as separate components within a pipeline, each requiring a distinct model and training data. While relation extraction can often be trained with readily available weak or distant supervision, entity linkers typically require expensive mention-level supervision -- which is not available in many domains. Instead, we propose a model which is trained to simultaneously produce entity linking and relation decisions while requiring no mention-level annotations. This approach avoids cascading errors that arise from pipelined methods and more accurately predicts entity relationships from text. We show that our model outperforms a state-of-the art entity linking and relation extraction pipeline on two biomedical datasets and can drastically improve the overall recall of the system.

Paper, Code (Coming Soon)
A2N: Attending to Neighbors for Knowledge Graph Inference. Trapit Bansal, Da-Cheng Juan, Sujith Ravi, Andrew McCallum. In Association for Computational Linguistics (ACL short), 2019. (Oral)

Abstract
State-of-the-art models for knowledge graph completion aim at learning a fixed embedding representation of entities in a multi-relational graph which can generalize to infer unseen entity relationships at test time. This can be sub-optimal as it requires memorizing and generalizing to all possible entity relationships using these fixed representations. We thus propose a novel attention-based method to learn query-dependent representation of entities which adaptively combines the relevant graph neighborhood of an entity leading to more accurate KG completion. The proposed method is evaluated on two benchmark datasets for knowledge graph completion, and experimental results show that the proposed model performs competitively or better than existing state-of-the-art, including recent methods for explicit multi-hop reasoning. Qualitative probing offers insight into how the model can reason about facts involving multiple hops in the knowledge graph, through the use of neighborhood attention.

Paper, Code, Talk
Emergent Complexity via Multi-Agent Competition. Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, Igor Mordatch. In International Conference on Learning Representations (ICLR), 2018.

Abstract

Reinforcement learning algorithms can train agents that solve problems in complex, interesting environments. Normally, the complexity of the trained agent is closely related to the complexity of the environment. This suggests that a highly capable agent requires a complex environment for training. In this paper, we point out that a competitive multi-agent environment trained with self-play can produce behaviors that are far more complex than the environment itself. We also point out that such environments come with a natural curriculum, because for any skill level, an environment full of agents of this level will have the right level of difficulty.
This work introduces several competitive multi-agent environments where agents compete in a 3D world with simulated physics. The trained agents learn a wide variety of complex and interesting skills, even though the environment themselves are relatively simple. The skills include behaviors such as running, blocking, ducking, tackling, fooling opponents, kicking, and defending using both arms and legs. A highlight of the learned behaviors can be found here.

Paper, Code, OpenAI Blog Post
Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments. Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, Pieter Abbeel. In International Conference on Learning Representations (ICLR), 2018. (Best Paper)

Abstract
Ability to continuously learn and adapt from limited experience in nonstationary environments is an important milestone on the path towards general intelligence. In this paper, we cast the problem of continuous adaptation into the learning-to-learn framework. We develop a simple gradient-based meta-learning algorithm suitable for adaptation in dynamically changing and adversarial scenarios. Additionally, we design a new multi-agent competitive environment, RoboSumo, and define iterated adaptation games for testing various aspects of continuous adaptation strategies. We demonstrate that meta-learning enables significantly more efficient adaptation than reactive baselines in the few-shot regime. Our experiments with a population of agents that learn and compete suggest that meta-learners are the fittest.

Paper, Code, OpenAI Blog Post
Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets. Nathan Greenberg, Trapit Bansal, Patrick Verga and Andrew McCallum. In Empirical Menthods in Natural Language Processing (EMNLP short), 2018. (Oral)

Abstract
Extracting typed entity mentions from text is a fundamental component to language understanding and reasoning. While there exist substantial labeled text datasets for multiple subsets of biomedical entity types---such as genes and proteins, or chemicals and diseases---it is rare to find large labeled datasets containing labels for all desired entity types together. This paper presents a method for training a single CRF extractor from multiple datasets with disjoint or partially overlapping sets of entity types. Our approach employs marginal likelihood training to insist on labels that are present in the data, while filling in ``missing labels''. This allows us to leverage all the available data within a single model. In experimental results on the Biocreative V CDR (chemicals/diseases), Biocreative VI ChemProt (chemicals/proteins) and MedMentions (19 entity types) datasets, we show that joint training on multiple datasets improves NER F1 over training in isolation, and our methods achieve state-of-the-art results.

Paper, Code
Ask the GRU: Multi-task Learning for Deep Text Recommendations. Trapit Bansal, David Belanger, Andrew McCallum. In ACM international conference on Recommender Systems (RecSys), 2016.
Abstract
In a variety of application domains the content to be recommended to users is associated with text. This includes research papers, movies with associated plot summaries, news articles, blog posts, etc. Recommendation approaches based on latent factor models can be extended naturally to leverage text by employing an explicit mapping from text to factors. This enables recommendations for new, unseen content, and may generalize better, since the factors for all items are produced by a compactly-parametrized model. Previous work has used topic models or averages of word embeddings for this mapping. In this paper we present a method leveraging deep recurrent neural networks to encode the text sequence into a latent vector, specifically gated recurrent units (GRUs) trained end-to-end on the collaborative filtering task. For the task of scientific paper recommendation, this yields models with significantly higher accuracy. In cold-start scenarios, we beat the previous state-of-the-art, all of which ignore word order. Performance is further improved by multi-task learning, where the text encoder network is trained for a combination of content recommendation and item metadata prediction. This regularizes the collaborative filtering model, ameliorating the problem of sparsity of the observed rating matrix.

Downloads: Paper
Content Driven User Profiling for Comment-Worthy Recommendations of News and Blog Articles. Trapit Bansal, Mrinal Das, Chiranjib Bhattacharyya. In ACM international conference on Recommender Systems (RecSys), 2015.
Abstract
We consider the problem of recommending comment-worthy articles such as news and blog-posts. An article is defined to be comment-worthy for a particular user if that user is interested to leave a comment on it. We note that recommending comment-worthy articles calls for elicitation of commenting-interests of the user from the content of both the articles and the past comments made by users. We thus propose to develop content-driven user profiles to elicit these latent interests of users in commenting and use them to recommend articles for future commenting. The difficulty of modeling comment content and the varied nature of users' commenting interests make the problem technically challenging.
The problem of recommending comment-worthy articles is resolved by leveraging article and comment content through topic modeling and the co-commenting pattern of users through collaborative filtering, combined within a novel hierarchical Bayesian modeling approach. Our solution, Collaborative Correspondence Topic Models (CCTM), generates user profiles which are leveraged to provide a personalized ranking of comment-worthy articles for each user. Through these content-driven user profiles, CCTM effectively handle the ubiquitous problem of cold-start without relying on additional meta-data. The inference problem for the model is intractable with no off-the-shelf solution and we develop an efficient Monte Carlo EM algorithm. CCTM is evaluated on three real world data-sets, crawled from two blogs, ArsTechnica (AT) Gadgets (102,087 comments) and AT-Science (71,640 comments), and a news site, DailyMail (33,500 comments). We show average improvement of 14% (warm-start) and 18% (cold-start) in AUC, and 80% (warm-start) and 250% (cold-start) in Hit-Rank@5, over state of the art.

Downloads: Paper
Ordered Stick-Breaking Prior for Sequential MCMC Inference of Bayesian Nonparametric Models. Mrinal Das, Trapit Bansal, Chiranjib Bhattacharyya. In International Conference on Machine Learning (ICML), 2015.
Abstract
This paper introduces a novel stick-breaking process namely ordered stick-breaking process (OSBP), where the atoms appear in order. The choice of weights on atoms of OSBP ensure two important things; (1) that probability of adding new atoms exponentially decrease with time and (2) OSBP, though non-exchangeable, admits pre- dictive probability functions (PPFs). We apply OSBP to Bayesian nonparametric (BNP) models and find that in a sequential setting where data is arriving in mini-batches OSBP forms a natural prior over mini-batches, facilitating exchange of relevant statistical information across mini-batches by sharing the atoms of OSBP. One of the major contributions of this paper is SUMO, an MCMC algorithm for solving the inference problem arising from applying OSBP to BNP models. SUMO uses the PPFs of OSBP to obtain a Gibbs sampling based truncation-free algorithm which applies generally to BNP models. For large scale inference problems existing algorithms such as Particle Filtering (PF) are not practical and variational procedures such as TSVI (Wang & Blei, 2012) are the only alternative. SUMO is thus an important addition to MCMC family which works well empirically. For Dirichlet process mixture model (DPMM), SUMO outperforms TSVI (Wang & Blei, 2012) on perplexity by 33% on 3 datasets with million data points, which are beyond the scope of PF, using only 3GB RAM.

Downloads: Paper, Supplementary
Relating Romanized Comments to News Articles by Inferring Multi-glyphic Topical Correspondence. Goutham Tholpadi, Mrinal Das, Trapit Bansal, Chiranjib Bhattacharyya. In AAAI conference on artificial intelligence (AAAI), 2015.

A Provable SVD-based Algorithm for Learning Topics in Dominant Admixture Corpus". Trapit Bansal, Chiranjib Bhattacharyya, Ravindran Kannan. In Neural Information Processing Systems (NeurIPS), 2014.

Abstract

Topic models, such as Latent Dirichlet Allocation (LDA), posit that documents are drawn from admixtures of distributions over words, known as topics. The inference problem of recovering topics from such a collection of documents drawn from admixtures, is NP-hard. Making a strong assumption called separability, Arora et. al. (2012) gave the first provable algorithm for inference. For the widely used LDA model, Anandkumar et. al. (2012) gave a provable algorithm using clever tensor-methods. But Arora et. al. (2012) and Anandkumar et. al. (2012) do not learn topic vectors with bounded \(l_1\) error (a natural measure for probability vectors).
Our aim is to develop a model which makes intuitive and empirically supported assumptions and to design an algorithm with natural, simple components such as SVD, which provably solves the inference problem for the model with bounded \(l_1\) error. A topic in LDA and other models is essentially characterized by a group of co-occurring words. Motivated by this, we introduce topic specific Catchwords, a group of words which occur with strictly greater frequency in a topic than any other topic individually and are required to have high frequency together rather than individually. A major contribution of the paper is to show that under this more realistic assumption, which is empirically verified on real corpora, a singular value decomposition (SVD) based algorithm with a crucial pre-processing step of thresholding, can provably recover the topics from a collection of documents drawn from Dominant admixtures. Dominant admixtures are convex combination of distributions in which one distribution has a significantly higher contribution than the others. Apart from the simplicity of the algorithm, the sample complexity has near optimal dependence on \(w_0\), the lowest probability that a topic is dominant, and is better than Arora et. al. (2012). Empirical evidence shows that on several real world corpora, both Catchwords and Dominant admixture assumptions hold and the proposed algorithm substantially outperforms the state of the art Arora et. al. (2013).

Downloads: Paper, Supplementary, Code

Going Beyond Corr-LDA for Detecting Specific Comments on News & Blogs. Mrinal Das, Trapit Bansal, Chiranjib Bhattacharyya. In ACM international conference on Web Search and Data Mining (WSDM), 2014.

Abstract

Understanding user generated comments in response to news and blog posts is an important area of research. After ignoring irrelevant comments, one finds that a large fraction, approximately 50%, of the comments are very specific and can be further related to certain parts of the article instead of the entire story. For example, in a recent product review of Google Nexus 7 in ArsTechnica (a popular blog), the reviewer talks about the prospect of “Retina equipped iPad mini” in a few sentences. It is interesting that although the article is on Nexus 7, but a significant number of comments are focused on this specific point regarding “iPad ”. We pose the problem of detecting such comments as specific comments location (SCL) problem. SCL is an important open problem with no prior work. SCL can be posed as a correspondence problem between comments and the parts of the relevant article, and one could potentially use Corr-LDA type models. Unfortunately, such models do not give satisfactory performance as they are restricted to using a single topic vector per article-comments pair. In this paper we propose to go beyond the single topic vector assumption and propose a novel correspondence topic model, namely SCTM, which admits multiple topic vectors (MTV) per article-comments pair. The resulting inference problem is quite complicated because of MTV and has no off-the-shelf solution. One of the major contributions of this paper is to show that using stick-breaking process as a prior over MTV, one can derive a collapsed Gibbs sampling procedure, which empirically works well for SCL.
SCTM is rigorously evaluated on three datasets, crawled from Yahoo! News (138,000 comments) and two blogs, ArsTechnica (AT) Science (90,000 comments) and AT-Gadget (160,000 comments). We observe that SCTM performs better than Corr-LDA, not only in terms of metrics like perplexity and topic coherence but also discovers more unique topics. We see that this immediately leads to an order of magnitude improvement in F1 score over Corr-LDA for SCL.

Downloads: Paper, Presentation, Code, Data

Workshops / Competitions

[New] Self-Supervised Learning on Multi-spectral Satellite Data for Near-Term Solar Forecasting. Akansha Singh Bansal, Trapit Bansal, David Irwin. In ICML Workshop on Tackling Climate Change with Machine Learning, 2021.

Abstract
With the unprecedented increase in the photo-voltaic (PV) installations across the globe, there is a thrust for reliable and accurate predictions of solar generation. While PV output is affected by a number of factors, atmospheric factors play a dominant role in deciding the amount of down-welling solar irradiance that directly effect solar generation at any site at any given time. This paper demonstrates that using self-supervised learning on multi-spectral satellite data from the new GOES-r range of satellites can improve near-term (15 minutes) solar forecasting. We develop deep auto-regressive models using convolutional neural networks (CNN) and long short-term memory networks (LSTM) that are trained globally across many sites of interest on the raw spatio-temporal satellite observations that are available in abundance, without requiring matched site-specific solar generation data that is scarcely available. This self-supervised model provides estimates of future solar irradiance that can be directly input to a regression model trained on smaller site-specific solar generation data to provide near-term forecasts of solar power at the site. The regression implicitly models site-specific characteristics such as installation capacity, panel tilt, orientation, etc, and the self-supervised CNN-LSTM implicitly captures global atmospheric patterns affecting the solar irradiation received at the site. We show this leads to accurate end-to-end solar forecasting and holds the potential to be useful for many other applications that depend on modeling such atmospheric patterns.

Paper, Code
[New] Simultaneously Self-Attending to Text and Entities for Knowledge-Informed Text Representations. Dung Thai*, Raghuveer Thirukovalluru*, Trapit Bansal*, Andrew McCallum. In 6^th ACL Workshop on Representation Learning for NLP (RepL4NLP), 2021.

Abstract
Pre-trained language models have emerged as highly successful methods for learning good text representations. However, the amount of structured knowledge retained in such models, and how (if at all) it can be extracted, remains an open question. In this work, we aim at directly learning text representations which leverage structured knowledge about entities mentioned in the text. This can be particularly beneficial for downstream tasks which are knowledge-intensive. Our approach utilizes self-attention between words in the text and knowledge graph (KG) entities mentioned in the text. While existing methods require entity-linked data for pre-training, we train using a mention-span masking objective and a candidate ranking objective – which doesn’t require any entity-links and only assumes access to an alias table for retrieving candidates, enabling large-scale pre-training. We show that the proposed model learns knowledgeinformed text representations that yield improvements on the downstream tasks over existing methods.

Paper, Code
Unsupervised Pre-training for Biomedical Question Answering. Vaishnavi Kommaraju*, Karthick Gunasekaran*, Kun Li*, Trapit Bansal, Andrew McCallum, Ivana Williams, Ana-Maria Istrate. In CLEF 8^th Workshop on Large-Scale Biomedical Semantic Indexing and Question Answering (BioASQ), 2020. (2^nd in BioASQ Competition 8B Phase B)

Abstract
We explore the suitability of unsupervised representation learning methods on biomedical text – BioBERT, SciBert, and BioSentVec – for biomedical question answering. To further improve unsupervised representations for biomedical QA, we introduce a new pre-training task from unlabeled data designed to reason about biomedical entities in the context. Our pre-training method consists of corrupting a given context by randomly replacing some mention of a biomedical entity with a random entity mention and then querying the model with the correct entity mention in order to locate the corrupted part of the context. This de-noising task enables the model to learn good representations from abundant, unlabeled biomedical text that helps QA tasks and minimizes the train-test mismatch between the pre-training task and the downstream QA tasks by requiring the model to predict spans. Our experiments show that pre-training BioBERT on the proposed pre-training task significantly boosts performance and outperforms the previous best model from the 7th BioASQ Task 7b-Phase B challenge.

Paper, Competition Leaderboard
RelNet: End-to-end Modeling of Entities and Relations. Trapit Bansal, Arvind Neelakantan, Andrew McCallum. In NeurIPS Workshop on Automated Knowledge Base Construction (AKBC), 2017.
Abstract
We introduce RelNet: a new model for relational reasoning. RelNet is a memory augmented neural network which models entities as abstract memory slots and is equipped with an additional relational memory which models relations between all memory pairs. The model thus builds an abstract knowledge graph on the entities and relations present in a document which can then be used to answer questions about the document. It is trained end-to-end: only supervision to the model is in the form of correct answers to the questions. We test the model on the 20 bAbI question-answering tasks with 10k examples per task and find that it solves all the tasks with a mean error of 0.3%, achieving 0% error on 11 of the 20 tasks.

Paper
Low-Rank Hidden State Embeddings for Viterbi Sequence Labeling. Dung Thai, Shikhar Murty, Trapit Bansal, Luke Vilnis, David Belanger, Andrew McCallum . In ICML Workshop on Deep Structured Prediction, 2017.
Abstract
In textual information extraction and other sequence labeling tasks it is now common to use recurrent neural networks (such as LSTM) to form rich embedded representations of long-term input co-occurrence patterns. Representation of output co-occurrence patterns is typically limited to a hand-designed graphical model, such as a linear-chain CRF representing short-term Markov dependencies among successive labels. This paper presents a method that learns embedded representations of latent output structure in sequence data. Our model takes the form of a finite-state machine with a large number of latent states per label (a latent variable CRF), where the state-transition matrix is factorized—effectively forming an embedded representation of statetransitions capable of enforcing long-term label dependencies, while supporting exact Viterbi inference over output labels. We demonstrate accuracy improvements and interpretable latent structure in a synthetic but complex task based on CoNLL named entity recognition.

Downloads: Paper

Hi, I'm Trapit !

Highlights

Work Experience

Publications

Conferences

Workshops / Competitions

Talks