# The NLP Index

Total repos: 7,091
hits:
time: ms
Added Title Abstract Authors Paper Graph Code
11/25/2021 XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale
This paper presents XLS-R, a large-scale model for cross-lingual speech representation learning based on wav2vec 2.0. We train models with up to 2B parameters on nearly half a million hours of publicly available speech audio in 128 languages, an order of magnitude more public data than the largest known prior work. Our evaluation covers a wide range of tasks, domains, data regimes and languages, both high and low-resource. On the CoVoST-2 speech translation benchmark, we improve the previous state of the art by an average of 7.4 BLEU over 21 translation directions into English. For speech recognition, XLS-R improves over the best known prior work on BABEL, MLS, CommonVoice as well as VoxPopuli, lowering error rates by 14-34% relative on average. XLS-R also sets a new state of the art on VoxLingua107 language identification. Moreover, we show that with sufficient model size, cross-lingual pretraining can outperform English-only pretraining when translating English speech into other languages, a setting which favors monolingual pretraining. We hope XLS-R can help to improve speech processing tasks for many more languages of the world.
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli
14639
Python
11/25/2021 DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model. We have pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its exceptional performance on a wide range of downstream natural language understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an example, the DeBERTaV3 Large model achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure. Furthermore, we have pre-trained a multi-lingual model mDeBERTa and observed a larger improvement over strong baselines compared to English models. For example, the mDeBERTa Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6% improvement over XLM-R Base, creating a new SOTA on this benchmark. We have made our pre-trained models and inference code publicly available at this https URL.
Pengcheng He, Jianfeng Gao, Weizhu Chen
771
Python
11/25/2021 DataCLUE: A Benchmark Suite for Data-centric NLP
Data-centric AI has recently proven to be more effective and high-performance, while traditional model-centric AI delivers fewer and fewer benefits. It emphasizes improving the quality of datasets to achieve better model performance. This field has significant potential because of its great practicability and getting more and more attention. However, we have not seen significant research progress in this field, especially in NLP. We propose DataCLUE, which is the first Data-Centric benchmark applied in NLP field. We also provide three simple but effective baselines to foster research in this field (improve Macro-F1 up to 5.7% point). In addition, we conduct comprehensive experiments with human annotators and show the hardness of DataCLUE. We also try an advanced method: the forgetting informed bootstrapping label correction method. All the resources related to DataCLUE, including datasets, toolkit, leaderboard, and baselines, is available online at this https URL
Liang Xu, Jiacheng Liu, Xiang Pan, Xiaojing Lu, Xiaofeng Hou
90
Python
11/25/2021 RATE: Overcoming Noise and Sparsity of Textual Features in Real-Time Location Estimation
Real-time location inference of social media users is the fundamental of some spatial applications such as localized search and event detection. While tweet text is the most commonly used feature in location estimation, most of the prior works suffer from either the noise or the sparsity of textual features. In this paper, we aim to tackle these two problems. We use topic modeling as a building block to characterize the geographic topic variation and lexical variation so that "one-hot" encoding vectors will no longer be directly used. We also incorporate other features which can be extracted through the Twitter streaming API to overcome the noise problem. Experimental results show that our RATE algorithm outperforms several benchmark methods, both in the precision of region classification and the mean distance error of latitude and longitude regression.
Yu Zhang, Wei Wei, Binxuan Huang, Kathleen M. Carley, Yan Zhang
5
C++
11/25/2021 Interpreting Language Models Through Knowledge Graph Extraction
Transformer-based language models trained on large text corpora have enjoyed immense popularity in the natural language processing community and are commonly used as a starting point for downstream tasks. While these models are undeniably useful, it is a challenge to quantify their performance beyond traditional accuracy metrics. In this paper, we compare BERT-based language models through snapshots of acquired knowledge at sequential stages of the training process. Structured relationships from training corpora may be uncovered through querying a masked language model with probing tasks. We present a methodology to unveil a knowledge acquisition timeline by generating knowledge graph extracts from cloze "fill-in-the-blank" statements at various stages of RoBERTa's early training. We extend this analysis to a comparison of pretrained variations of BERT models (DistilBERT, BERT-base, RoBERTa). This work proposes a quantitative framework to compare language models through knowledge graph extraction (GED, Graph2Vec) and showcases a part-of-speech analysis (POSOR) to identify the linguistic strengths of each model variant. Using these metrics, machine learning practitioners can compare models, diagnose their models' behavioral strengths and weaknesses, and identify new targeted datasets to improve model performance.
Vinitra Swamy, Angelika Romanou, Martin Jaggi
4
Jupyter Notebook
11/25/2021 Integrated Semantic and Phonetic Post-correction for Chinese Speech Recognition
Due to the recent advances of natural language processing, several works have applied the pre-trained masked language model (MLM) of BERT to the post-correction of speech recognition. However, existing pre-trained models only consider the semantic correction while the phonetic features of words is neglected. The semantic-only post-correction will consequently decrease the performance since homophonic errors are fairly common in Chinese ASR. In this paper, we proposed a novel approach to collectively exploit the contextualized representation and the phonetic information between the error and its replacing candidates to alleviate the error rate of Chinese ASR. Our experiment results on real world speech recognition datasets showed that our proposed method has evidently lower CER than the baseline model, which utilized a pre-trained BERT MLM as the corrector.
Yi-Chang Chen, Chun-Yen Cheng, Chien-An Chen, Ming-Chieh Sung, Yi-Ren Yeh
4
Python
11/25/2021 You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
Transformer-based models are widely used in natural language processing (NLP). Central to the transformer model is the self-attention mechanism, which captures the interactions of token pairs in the input sequences and depends quadratically on the sequence length. Training such models on longer sequences is expensive. In this paper, we show that a Bernoulli sampling attention mechanism based on Locality Sensitive Hashing (LSH), decreases the quadratic complexity of such models to linear. We bypass the quadratic cost by considering self-attention as a sum of individual tokens associated with Bernoulli random variables that can, in principle, be sampled at once by a single hash (although in practice, this number may be a small constant). This leads to an efficient sampling scheme to estimate self-attention which relies on specific modifications of LSH (to enable deployment on GPU architectures). We evaluate our algorithm on the GLUE benchmark with standard 512 sequence length where we see favorable performance relative to a standard pretrained Transformer. On the Long Range Arena (LRA) benchmark, for evaluating performance on long sequences, our method achieves results consistent with softmax self-attention but with sizable speed-ups and memory savings and often outperforms other efficient self-attention methods. Our code is available at this https URL
Zhanpeng Zeng, Yunyang Xiong, Sathya N. Ravi, Shailesh Acharya, Glenn Fung, Vikas Singh
4
Python
11/25/2021 RoBERTuito: a pre-trained language model for social media text in Spanish
Since BERT appeared, Transformer language models and transfer learning have become state-of-the-art for Natural Language Understanding tasks. Recently, some works geared towards pre-training, specially-crafted models for particular domains, such as scientific papers, medical documents, and others. In this work, we present RoBERTuito, a pre-trained language model for user-generated content in Spanish. We trained RoBERTuito on 500 million tweets in Spanish. Experiments on a benchmark of 4 tasks involving user-generated text showed that RoBERTuito outperformed other pre-trained language models for Spanish. In order to help further research, we make RoBERTuito publicly available at the HuggingFace model hub.
Juan Manuel Perez, Damian A. Furman, Laura Alonso Alemany, Franco Luque
3
Jupyter Notebook
11/25/2021 Patent Sentiment Analysis to Highlight Patent Paragraphs
Given a patent document, identifying distinct semantic annotations is an interesting research aspect. Text annotation helps the patent practitioners such as examiners and patent attorneys to quickly identify the key arguments of any invention, successively providing a timely marking of a patent text. In the process of manual patent analysis, to attain better readability, recognising the semantic information by marking paragraphs is in practice. This semantic annotation process is laborious and time-consuming. To alleviate such a problem, we proposed a novel dataset to train Machine Learning algorithms to automate the highlighting process. The contributions of this work are: i) we developed a multi-class, novel dataset of size 150k samples by traversing USPTO patents over a decade, ii) articulated statistics and distributions of data using imperative exploratory data analysis, iii) baseline Machine Learning models are developed to utilize the dataset to address patent paragraph highlighting task, iv) dataset and codes relating to this task are open-sourced through a dedicated GIT web page: this https URL and v) future path to extend this work using Deep Learning and domain specific pre-trained language models to develop a tool to highlight is provided. This work assist patent practitioners in highlighting semantic information automatically and aid to create a sustainable and efficient patent analysis using the aptitude of Machine Learning.
Renukswamy Chikkamath, Vishvapalsinhji Ramsinh Parmar, Christoph Hewel, Markus Endres
3
Jupyter Notebook
Cheng Hsu, Cheng-Te Li, Diego Saez-Trumper, Yi-Zhan Hsu
3
Python
11/25/2021 Speaker and Time-aware Joint Contextual Learning for Dialogue-act Classification in Counselling Conversations
The onset of the COVID-19 pandemic has brought the mental health of people under risk. Social counselling has gained remarkable significance in this environment. Unlike general goal-oriented dialogues, a conversation between a patient and a therapist is considerably implicit, though the objective of the conversation is quite apparent. In such a case, understanding the intent of the patient is imperative in providing effective counselling in therapy sessions, and the same applies to a dialogue system as well. In this work, we take forward a small but an important step in the development of an automated dialogue system for mental-health counselling. We develop a novel dataset, named HOPE, to provide a platform for the dialogue-act classification in counselling conversations. We identify the requirement of such conversation and propose twelve domain-specific dialogue-act (DAC) labels. We collect 12.9K utterances from publicly-available counselling session videos on YouTube, extract their transcripts, clean, and annotate them with DAC labels. Further, we propose SPARTA, a transformer-based architecture with a novel speaker- and time-aware contextual learning for the dialogue-act classification. Our evaluation shows convincing performance over several baselines, achieving state-of-the-art on HOPE. We also supplement our experiments with extensive empirical and qualitative analyses of SPARTA.
Ganeshan Malhotra, Abdul Waheed, Aseem Srivastava, Md Shad Akhtar, Tanmoy Chakraborty
2
Python
11/25/2021 Attention-based Multi-hypothesis Fusion for Speech Summarization
Speech summarization, which generates a text summary from speech, can be achieved by combining automatic speech recognition (ASR) and text summarization (TS). With this cascade approach, we can exploit state-of-the-art models and large training datasets for both subtasks, i.e., Transformer for ASR and Bidirectional Encoder Representations from Transformers (BERT) for TS. However, ASR errors directly affect the quality of the output summary in the cascade approach. We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary. We investigate several schemes to combine ASR hypotheses. First, we propose using the sum of sub-word embedding vectors weighted by their posterior values provided by an ASR system as an input to a BERT-based TS system. Then, we introduce a more general scheme that uses an attention-based fusion module added to a pre-trained BERT module to align and combine several ASR hypotheses. Finally, we perform speech summarization experiments on the How2 dataset and a newly assembled TED-based dataset that we will release with this paper. These experiments show that retraining the BERT-based TS system with these schemes can improve summarization performance and that the attention-based fusion module is particularly effective.
Takatomo Kano, Atsunori Ogawa, Marc Delcroix, Shinji Watanabe
2
Python
Alexander R. Fabbri, Xiaojian Wu, Srini Iyer, Haoran Li, Mona Diab
1
Python
11/25/2021 Improving the robustness and accuracy of biomedical language models through adversarial training
1
Python
11/25/2021 PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages
NLP applications for code-mixed (CM) or mix-lingual text have gained a significant momentum recently, the main reason being the prevalence of language mixing in social media communications in multi-lingual societies like India, Mexico, Europe, parts of USA etc. Word embeddings are basic build-ing blocks of any NLP system today, yet, word embedding for CM languages is an unexplored territory. The major bottleneck for CM word embeddings is switching points, where the language switches. These locations lack in contextually and statistical systems fail to model this phenomena due to high variance in the seen examples. In this paper we present our initial observations on applying switching point based positional encoding techniques for CM language, specifically Hinglish (Hindi - English). Results are only marginally better than SOTA, but it is evident that positional encoding could bean effective way to train position sensitive language models for CM text.
Mohsin Ali, Kandukuri Sai Teja, Sumanth Manduru, Parth Patwa, Amitava Das
1
Jupyter Notebook
11/25/2021 Seeking Common but Distinguishing Difference, A Joint Aspect-based Sentiment Analysis Model
Aspect-based sentiment analysis (ABSA) task consists of three typical subtasks: aspect term extraction, opinion term extraction, and sentiment polarity classification. These three subtasks are usually performed jointly to save resources and reduce the error propagation in the pipeline. However, most of the existing joint models only focus on the benefits of encoder sharing between subtasks but ignore the difference. Therefore, we propose a joint ABSA model, which not only enjoys the benefits of encoder sharing but also focuses on the difference to improve the effectiveness of the model. In detail, we introduce a dual-encoder design, in which a pair encoder especially focuses on candidate aspect-opinion pair classification, and the original encoder keeps attention on sequence labeling. Empirical results show that our proposed model shows robustness and significantly outperforms the previous state-of-the-art on four benchmark datasets.
Hongjiang Jing, Zuchao Li, Hai Zhao, Shu Jiang
0
11/25/2021 IIITT@Dravidian-CodeMix-FIRE2021: Transliterate or translate? Sentiment analysis of code-mixed text in Dravidian languages
Sentiment analysis of social media posts and comments for various marketing and emotional purposes is gaining recognition. With the increasing presence of code-mixed content in various native languages, there is a need for ardent research to produce promising results. This research paper bestows a tiny contribution to this research in the form of sentiment analysis of code-mixed social media comments in the popular Dravidian languages Kannada, Tamil and Malayalam. It describes the work for the shared task conducted by Dravidian-CodeMix at FIRE 2021 by employing pre-trained models like ULMFiT and multilingual BERT fine-tuned on the code-mixed dataset, transliteration (TRAI) of the same, English translations (TRAA) of the TRAI data and the combination of all the three. The results are recorded in this research paper where the best models stood 4th, 5th and 10th ranks in the Tamil, Kannada and Malayalam tasks respectively.
Karthik Puranik, Bharathi B, Senthil Kumar B
0
Jupyter Notebook
11/25/2021 Extracting and filtering paraphrases by bridging natural language inference and paraphrasing
Paraphrasing is a useful natural language processing task that can contribute to more diverse generated or translated texts. Natural language inference (NLI) and paraphrasing share some similarities and can benefit from a joint approach. We propose a novel methodology for the extraction of paraphrasing datasets from NLI datasets and cleaning existing paraphrasing datasets. Our approach is based on bidirectional entailment; namely, if two sentences can be mutually entailed, they are paraphrases. We evaluate our approach using several large pretrained transformer language models in the monolingual and cross-lingual setting. The results show high quality of extracted paraphrasing datasets and surprisingly high noise levels in two existing paraphrasing datasets.
Matej Klemen, Marko Robnik-Sikonja
0
Python
11/25/2021 Multi-Attribute Relation Extraction (MARE) -- Simplifying the Application of Relation Extraction
Natural language understanding's relation extraction makes innovative and encouraging novel business concepts possible and facilitates new digitilized decision-making processes. Current approaches allow the extraction of relations with a fixed number of entities as attributes. Extracting relations with an arbitrary amount of attributes requires complex systems and costly relation-trigger annotations to assist these systems. We introduce multi-attribute relation extraction (MARE) as an assumption-less problem formulation with two approaches, facilitating an explicit mapping from business use cases to the data annotations. Avoiding elaborated annotation constraints simplifies the application of relation extraction approaches. The evaluation compares our models to current state-of-the-art event extraction and binary relation extraction methods. Our approaches show improvement compared to these on the extraction of general multi-attribute relations.
Lars Kloser, Philipp Kohl, Bodo Kraft, Albert Zundorf
0
Python
11/25/2021 Memotion Analysis through the Lens of Joint Embedding
Joint embedding (JE) is a way to encode multi-modal data into a vector space where text remains as the grounding key and other modalities like image are to be anchored with such keys. Meme is typically an image with embedded text onto it. Although, memes are commonly used for fun, they could also be used to spread hate and fake information. That along with its growing ubiquity over several social platforms has caused automatic analysis of memes to become a widespread topic of research. In this paper, we report our initial experiments on Memotion Analysis problem through joint embeddings. Results are marginally yielding SOTA.
Nethra Gunti, Sathyanarayanan Ramamoorthy, Parth Patwa, Amitava Das
0
Jupyter Notebook
11/25/2021 SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization
In the summarization domain, a key requirement for summaries is to be factually consistent with the input document. Previous work has found that natural language inference (NLI) models do not perform competitively when applied to inconsistency detection. In this work, we revisit the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level). We provide a highly effective and light-weight method called SummaCConv that enables NLI models to be successfully used for this task by segmenting documents into sentence units and aggregating scores between pairs of sentences. On our newly introduced benchmark called SummaC (Summary Consistency) consisting of six large inconsistency detection datasets, SummaCConv obtains state-of-the-art results with a balanced accuracy of 74.4%, a 5% point improvement compared to prior work. We make the models and datasets available: this https URL
Philippe Laban, Tobias Schnabel, Paul N. Bennett, Marti A. Hearst
0
Jupyter Notebook
11/25/2021 Character Transformations for Non-Autoregressive GEC Tagging
We propose a character-based nonautoregressive GEC approach, with automatically generated character transformations. Recently, per-word classification of correction edits has proven an efficient, parallelizable alternative to current encoder-decoder GEC systems. We show that word replacement edits may be suboptimal and lead to explosion of rules for spelling, diacritization and errors in morphologically rich languages, and propose a method for generating character transformations from GEC corpus. Finally, we train character transformation models for Czech, German and Russian, reaching solid results and dramatic speedup compared to autoregressive systems. The source code is released at this https URL.
Milan Straka, Jakub Naplava, Jana Strakova
0
Python
11/25/2021 Meeting Summarization with Pre-training and Clustering Methods
Automatic meeting summarization is becoming increasingly popular these days. The ability to automatically summarize meetings and to extract key information could greatly increase the efficiency of our work and life. In this paper, we experiment with different approaches to improve the performance of query-based meeting summarization. We started with HMNet\cite{hmnet}, a hierarchical network that employs both a word-level transformer and a turn-level transformer, as the baseline. We explore the effectiveness of pre-training the model with a large news-summarization dataset. We investigate adding the embeddings of queries as a part of the input vectors for query-based summarization. Furthermore, we experiment with extending the locate-then-summarize approach of QMSum\cite{qmsum} with an intermediate clustering step. Lastly, we compare the performance of our baseline models with BART, a state-of-the-art language model that is effective for summarization. We achieved improved performance by adding query embeddings to the input of the model, by using BART as an alternative language model, and by using clustering methods to extract key information at utterance level before feeding the text into summarization models.
Andras Huebner, Wei Ji, Xiang Xiao
0
Python
11/25/2021 Dynamic-TinyBERT: Boost TinyBERT's Inference Efficiency by Dynamic Sequence Length
Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. TinyBERT addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, TinyBERT's performance drops when we reduce the number of layers by 50%, and drops even more abruptly when we reduce the number of layers by 75% for advanced NLP tasks such as span question answering. Additionally, a separate model must be trained for each inference scenario with its distinct computational budget. In this work we present Dynamic-TinyBERT, a TinyBERT model that utilizes sequence-length reduction and Hyperparameter Optimization for enhanced inference efficiency per any computational budget. Dynamic-TinyBERT is trained only once, performing on-par with BERT and achieving an accuracy-speedup trade-off superior to any other efficient approaches (up to 3.3x with <1% loss-drop). Upon publication, the code to reproduce our work will be open-sourced.
Shira Guskin, Moshe Wasserblat, Ke Ding, Gyuwan Kim
n/a
11/25/2021 Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning
The task of few-shot style transfer for voice cloning in text-to-speech (TTS) synthesis aims at transferring speaking styles of an arbitrary source speaker to a target speaker's voice using very limited amount of neutral data. This is a very challenging task since the learning algorithm needs to deal with few-shot voice cloning and speaker-prosody disentanglement at the same time. Accelerating the adaptation process for a new target speaker is of importance in real-world applications, but even more challenging. In this paper, we approach to the hard fast few-shot style transfer for voice cloning task using meta learning. We investigate the model-agnostic meta-learning (MAML) algorithm and meta-transfer a pre-trained multi-speaker and multi-prosody base TTS model to be highly sensitive for adaptation with few samples. Domain adversarial training mechanism and orthogonal constraint are adopted to disentangle speaker and prosody representations for effective cross-speaker style transfer. Experimental results show that the proposed approach is able to conduct fast voice cloning using only 5 samples (around 12 second speech data) from a target speaker, with only 100 adaptation steps. Audio samples are available online.
Songxiang Liu, Dan Su, Dong Yu
11/18/2021 NLP From Scratch Without Large-Scale Pretraining: A Simple and Efficient Framework
Pretrained language models have become the standard approach for many NLP tasks due to strong performance, but they are very expensive to train. We propose a simple and efficient learning framework, TLM, that does not rely on large-scale pretraining. Given some labeled task data and a large general corpus, TLM uses task data as queries to retrieve a tiny subset of the general corpus and jointly optimizes the task objective and the language modeling objective from scratch. On eight classification datasets in four domains, TLM achieves results better than or similar to pretrained language models (e.g., RoBERTa-Large) while reducing the training FLOPs by two orders of magnitude. With high accuracy and efficiency, we hope TLM will contribute to democratizing NLP and expediting its development.
Xingcheng Yao, Yanan Zheng, Xiaocong Yang, Zhilin Yang
125
Python
11/18/2021 StyleCLIPDraw: Coupling Content and Style in Text-to-Drawing Synthesis
Generating images that fit a given text description using machine learning has improved greatly with the release of technologies such as the CLIP image-text encoder model; however, current methods lack artistic control of the style of image to be generated. We introduce StyleCLIPDraw which adds a style loss to the CLIPDraw text-to-drawing synthesis model to allow artistic control of the synthesized drawings in addition to control of the content via text. Whereas performing decoupled style transfer on a generated image only affects the texture, our proposed coupled approach is able to capture a style in both texture and shape, suggesting that the style of the drawing is coupled with the drawing process itself. More results and our code are available at this https URL
Peter Schaldenbrand, Zhixuan Liu, Jean Oh
110
Jupyter Notebook
Renrui Zhang, Rongyao Fang, Peng Gao, Wei Zhang, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li
22
11/18/2021 TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning
Masked language models (MLMs) such as BERT and RoBERTa have revolutionized the field of Natural Language Understanding in the past few years. However, existing pre-trained MLMs often output an anisotropic distribution of token representations that occupies a narrow subset of the entire representation space. Such token representations are not ideal, especially for tasks that demand discriminative semantic meanings of distinct tokens. In this work, we propose TaCL (Token-aware Contrastive Learning), a novel continual pre-training approach that encourages BERT to learn an isotropic and discriminative distribution of token representations. TaCL is fully unsupervised and requires no additional data. We extensively test our approach on a wide range of English and Chinese benchmarks. The results show that TaCL brings consistent and notable improvements over the original BERT model. Furthermore, we conduct detailed analysis to reveal the merits and inner-workings of our approach.
Yixuan Su, Fangyu Liu, Zaiqiao Meng, Lei Shu, Ehsan Shareghi, Nigel Collier
24
Python
11/18/2021 Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech
Sung-Feng Huang, Chyi-Jiunn Lin, Hung-yi Lee
-
18
Python
11/18/2021 MotifClass: Weakly Supervised Text Classification with Higher-order Metadata Information
We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only and without any annotated training document provided. Most existing approaches leverage textual information in each document. However, in many domains, documents are accompanied by various types of metadata (e.g., authors, venue, and year of a research paper). These metadata and their combinations may serve as strong category indicators in addition to textual contents. In this paper, we explore the potential of using metadata to help weakly supervised text classification. To be specific, we model the relationships between documents and metadata via a heterogeneous information network. To effectively capture higher-order structures in the network, we use motifs to describe metadata combinations. We propose a novel framework, named MotifClass, which (1) selects category-indicative motif instances, (2) retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances, and (3) trains a text classifier using the pseudo training data. Extensive experiments on real-world datasets demonstrate the superior performance of MotifClass to existing weakly supervised text classification approaches. Further analysis shows the benefit of considering higher-order metadata information in our framework.
Yu Zhang, Shweta Garg, Yu Meng, Xiusi Chen, Jiawei Han
-
8
Python
11/18/2021 The Curious Layperson: Fine-Grained Image Recognition without Expert Labels
Most of us are not experts in specific fields, such as ornithology. Nonetheless, we do have general image and language understanding capabilities that we use to match what we see to expert resources. This allows us to expand our knowledge and perform novel tasks without ad-hoc external supervision. On the contrary, machines have a much harder time consulting expert-curated knowledge bases unless trained specifically with that knowledge in mind. Thus, in this paper we consider a new problem: fine-grained image recognition without expert annotations, which we address by leveraging the vast knowledge available in web encyclopedias. First, we learn a model to describe the visual appearance of objects using non-expert image descriptions. We then train a fine-grained textual similarity model that matches image descriptions with documents on a sentence-level basis. We evaluate the method on two datasets and compare with several strong baselines and the state of the art in cross-modal retrieval. Code is available at: this https URL
Subhabrata Choudhury, Iro Laina, Christian Rupprecht, Andrea Vedaldi
-
7
11/18/2021 GupShup: An Annotated Corpus for Abstractive Summarization of Open-Domain Code-Switched Conversations
Code-switching is the communication phenomenon where speakers switch between different languages during a conversation. With the widespread adoption of conversational agents and chat platforms, code-switching has become an integral part of written conversations in many multi-lingual communities worldwide. This makes it essential to develop techniques for summarizing and understanding these conversations. Towards this objective, we introduce abstractive summarization of Hindi-English code-switched conversations and develop the first code-switched conversation summarization dataset - GupShup, which contains over 6,831 conversations in Hindi-English and their corresponding human-annotated summaries in English and Hindi-English. We present a detailed account of the entire data collection and annotation processes. We analyze the dataset using various code-switching statistics. We train state-of-the-art abstractive summarization models and report their performances using both automated metrics and human evaluation. Our results show that multi-lingual mBART and multi-view seq2seq models obtain the best performances on the new dataset
Laiba Mehnaz, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi, Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle Lee, Anish Acharya, Rajiv Ratn Shah
7
Python
11/18/2021 Poisoning Knowledge Graph Embeddings via Relation Inference Patterns
We study the problem of generating data poisoning attacks against Knowledge Graph Embedding (KGE) models for the task of link prediction in knowledge graphs. To poison KGE models, we propose to exploit their inductive abilities which are captured through the relationship patterns like symmetry, inversion and composition in the knowledge graph. Specifically, to degrade the model's prediction confidence on target facts, we propose to improve the model's prediction confidence on a set of decoy facts. Thus, we craft adversarial additions that can improve the model's prediction confidence on decoy facts through different inference patterns. Our experiments demonstrate that the proposed poisoning attacks outperform state-of-art baselines on four KGE models for two publicly available datasets. We also find that the symmetry pattern based attacks generalize across all model-dataset combinations which indicates the sensitivity of KGE models to this pattern.
Peru Bhardwaj, John Kelleher, Luca Costabello, Declan O'Sullivan
-
6
Python
11/18/2021 Variance-Aware Machine Translation Test Sets
We release 70 small and discriminative test sets for machine translation (MT) evaluation called variance-aware test sets (VAT), covering 35 translation directions from WMT16 to WMT20 competitions. VAT is automatically created by a novel variance-aware filtering method that filters the indiscriminative test instances of the current MT test sets without any human labor. Experimental results show that VAT outperforms the original WMT test sets in terms of the correlation with human judgement across mainstream language pairs and test sets. Further analysis on the properties of VAT reveals the challenging linguistic features (e.g., translation of low-frequency words and proper nouns) for competitive MT systems, providing guidance for constructing future MT test sets. The test sets and the code for preparing variance-aware MT test sets are freely available at this https URL .
Runzhe Zhan, Xuebo Liu, Derek F. Wong, Lidia S. Chao
-
5
Ruby
11/18/2021 Learning Logic Rules for Document-level Relation Extraction
Document-level relation extraction aims to identify relations between entities in a whole document. Prior efforts to capture long-range dependencies have relied heavily on implicitly powerful representations learned through (graph) neural networks, which makes the model less transparent. To tackle this challenge, in this paper, we propose LogiRE, a novel probabilistic model for document-level relation extraction by learning logic rules. LogiRE treats logic rules as latent variables and consists of two modules: a rule generator and a relation extractor. The rule generator is to generate logic rules potentially contributing to final predictions, and the relation extractor outputs final predictions based on the generated logic rules. Those two modules can be efficiently optimized with the expectation-maximization (EM) algorithm. By introducing logic rules into neural networks, LogiRE can explicitly capture long-range dependencies as well as enjoy better interpretation. Empirical results show that LogiRE significantly outperforms several strong baselines in terms of relation performance (1.8 F1 score) and logical consistency (over 3.3 logic score). Our code is available at this https URL.
Dongyu Ru, Changzhi Sun, Jiangtao Feng, Lin Qiu, Hao Zhou, Weinan Zhang, Yong Yu, Lei Li
10
11/18/2021 PerSpeechNorm: A Persian Toolkit for Speech Processing Normalization
In general, speech processing models consist of a language model along with an acoustic model. Regardless of the language model's complexity and variants, three critical pre-processing steps are needed in language models: cleaning, normalization, and tokenization. Among mentioned steps, the normalization step is so essential to format unification in pure textual applications. However, for embedded language models in speech processing modules, normalization is not limited to format unification. Moreover, it has to convert each readable symbol, number, etc., to how they are pronounced. To the best of our knowledge, there is no Persian normalization toolkits for embedded language models in speech processing modules, So in this paper, we propose an open-source normalization toolkit for text processing in speech applications. Briefly, we consider different readable Persian text like symbols (common currencies, #, @, URL, etc.), numbers (date, time, phone number, national code, etc.), and so on. Comparison with other available Persian textual normalization tools indicates the superiority of the proposed method in speech processing. Also, comparing the model's performance for one of the proposed functions (sentence separation) with other common natural language libraries such as HAZM and Parsivar indicates the proper performance of the proposed method. Besides, its evaluation of some Persian Wikipedia data confirms the proper performance of the proposed method.
5
Python
Peru Bhardwaj, John Kelleher, Luca Costabello, Declan O'Sullivan
3
Python
11/18/2021 JaMIE: A Pipeline Japanese Medical Information Extraction System
We present an open-access natural language processing toolkit for Japanese medical information extraction. We first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We experiment with the practical annotation scenarios by separately annotating two different types of reports. We design a pipeline system with three components for recognizing medical entities, classifying entity modalities, and extracting relations. The empirical results show accurate analyzing performance and suggest the satisfactory annotation quality, the effective annotation strategy for targeting report types, and the superiority of the latest contextual embedding models.
2
Python
11/18/2021 Machine-in-the-Loop Rewriting for Creative Image Captioning
Machine-in-the-loop writing aims to enable humans to collaborate with models to complete their writing tasks more effectively. Prior work has found that providing humans a machine-written draft or sentence-level continuations has limited success since the generated text tends to deviate from humans' intention. To allow the user to retain control over the content, we train a rewriting model that, when prompted, modifies specified spans of text within the user's original draft to introduce descriptive and figurative elements locally in the text. We evaluate the model on its ability to collaborate with humans on the task of creative image captioning. On a user study through Amazon Mechanical Turk, our model is rated to be more helpful than a baseline infilling language model. In addition, third-party evaluation shows that users write more descriptive and figurative captions when collaborating with our model compared to completing the task alone.
2
Python
11/18/2021 Focusing on Possible Named Entities in Active Named Entity Label Acquisition
Named entity recognition (NER) aims to identify mentions of named entities in an unstructured text and classify them into the predefined named entity classes. Even though deep learning-based pre-trained language models achieve good predictive performances, many domain-specific NERtasks still require a sufficient amount of labeled data. Active learning (AL), a general framework for the label acquisition problem, has been used for the NER tasks to minimize the annotation cost without sacrificing model performance. However, heavily imbalanced class distribution of tokens introduces challenges in designing effective AL querying methods for NER. We propose AL sentence query evaluation functions which pay more attention to possible positive tokens, and evaluate these proposed functions with both sentence-based and token-based cost evaluation strategies. We also propose a better data-driven normalization approach to penalize too long or too short sentences. Our experiments on three datasets from different domains reveal that the proposed approaches reduce the number of annotated tokens while achieving better or comparable prediction performance with conventional methods.
Ali Osman Berk Sapci, Oznur Tastan, Reyyan Yeniterzi
-
1
Python
11/18/2021 Training Cross-Lingual embeddings for Setswana and Sepedi
African languages still lag in the advances of Natural Language Processing techniques, one reason being the lack of representative data, having a technique that can transfer information between languages can help mitigate against the lack of data problem. This paper trains Setswana and Sepedi monolingual word vectors and uses VecMap to create cross-lingual embeddings for Setswana-Sepedi in order to do a cross-lingual transfer. Word embeddings are word vectors that represent words as continuous floating numbers where semantically similar words are mapped to nearby points in n-dimensional space. The idea of word embeddings is based on the distribution hypothesis that states, semantically similar words are distributed in similar contexts (Harris, 1954). Cross-lingual embeddings leverages monolingual embeddings by learning a shared vector space for two separately trained monolingual vectors such that words with similar meaning are represented by similar vectors. In this paper, we investigate cross-lingual embeddings for Setswana-Sepedi monolingual word vector. We use the unsupervised cross lingual embeddings in VecMap to train the Setswana-Sepedi cross-language word embeddings. We evaluate the quality of the Setswana-Sepedi cross-lingual word representation using a semantic evaluation task. For the semantic similarity task, we translated the WordSim and SimLex tasks into Setswana and Sepedi. We release this dataset as part of this work for other researchers. We evaluate the intrinsic quality of the embeddings to determine if there is improvement in the semantic representation of the word embeddings.
Mack Makgatho, Vukosi Marivate, Tshephisho Sefara, Valencia Wagner
1
11/18/2021 LILA: Language-Informed Latent Actions
We introduce Language-Informed Latent Actions (LILA), a framework for learning natural language interfaces in the context of human-robot collaboration. LILA falls under the shared autonomy paradigm: in addition to providing discrete language inputs, humans are given a low-dimensional controller $-$ e.g., a 2 degree-of-freedom (DoF) joystick that can move left/right and up/down $-$ for operating the robot. LILA learns to use language to modulate this controller, providing users with a language-informed control space: given an instruction like "place the cereal bowl on the tray," LILA may learn a 2-DoF space where one dimension controls the distance from the robot's end-effector to the bowl, and the other dimension controls the robot's end-effector pose relative to the grasp point on the bowl. We evaluate LILA with real-world user studies, where users can provide a language instruction while operating a 7-DoF Franka Emika Panda Arm to complete a series of complex manipulation tasks. We show that LILA models are not only more sample efficient and performant than imitation learning and end-effector control baselines, but that they are also qualitatively preferred by users.
Siddharth Karamcheti, Megha Srivastava, Percy Liang, Dorsa Sadigh
-
1
Python
11/18/2021 Distinguishing Commercial from Editorial Content in News
How can we distinguish commercial from editorial content in news, or more specifically, differentiate between advertorials and regular news articles? An advertorial is a commercial message written and formatted as an article, making it harder for readers to recognize these as advertising, despite the use of disclaimers. In our research we aim to differentiate the two using a machine learning model, and a lexicon derived from it. This was accomplished by scraping 1.000 articles and 1.000 advertorials from four different Dutch news sources and classifying these based on textual features. With this setup our most successful machine learning model had an accuracy of just over $90\%$. To generate additional insights into differences between news and advertorial language, we also analyzed model coefficients and explored the corpus through co-occurrence networks and t-SNE graphs.
Timo Kats, Peter van der Putten, Jasper Schelling
1
Python
11/18/2021 Sexism Prediction in Spanish and English Tweets Using Monolingual and Multilingual BERT and Ensemble Models
The popularity of social media has created problems such as hate speech and sexism. The identification and classification of sexism in social media are very relevant tasks, as they would allow building a healthier social environment. Nevertheless, these tasks are considerably challenging. This work proposes a system to use multilingual and monolingual BERT and data points translation and ensemble strategies for sexism identification and classification in English and Spanish. It was conducted in the context of the sEXism Identification in Social neTworks shared 2021 (EXIST 2021) task, proposed by the Iberian Languages Evaluation Forum (IberLEF). The proposed system and its main components are described, and an in-depth hyperparameters analysis is conducted. The main results observed were: (i) the system obtained better results than the baseline model (multilingual BERT); (ii) ensemble models obtained better results than monolingual models; and (iii) an ensemble model considering all individual models and the best standardized values obtained the best accuracies and F1-scores for both tasks. This work obtained first place in both tasks at EXIST, with the highest accuracies (0.780 for task 1 and 0.658 for task 2) and F1-scores (F1-binary of 0.780 for task 1 and F1-macro of 0.579 for task 2).
Angel Felipe Magnosso de Paula, Roberto Fray da Silva, Ipek Baris Schlicht
0
Jupyter Notebook
11/18/2021 AI-UPV at IberLEF-2021 DETOXIS task: Toxicity Detection in Immigration-Related Web News Comments Using Transformers and Statistical Models
Angel Felipe Magnoss de Paula, Ipek Baris Schlicht
-
0
Jupyter Notebook
11/18/2021 Topic Modeling, Clade-assisted Sentiment Analysis, and Vaccine Brand Reputation Analysis of COVID-19 Vaccine-related Facebook Comments in the Philippines
Vaccine hesitancy and other COVID-19-related concerns and complaints in the Philippines are evident on social media. It is important to identify these different topics and sentiments in order to gauge public opinion, use the insights to develop policies, and make necessary adjustments or actions to improve public image and reputation of the administering agency and the COVID-19 vaccines themselves. This paper proposes a semi-supervised machine learning pipeline to perform topic modeling, sentiment analysis, and an analysis of vaccine brand reputation to obtain an in-depth understanding of national public opinion of Filipinos on Facebook. The methodology makes use of a multilingual version of Bidirectional Encoder Representations from Transformers or BERT for topic modeling, hierarchical clustering, five different classifiers for sentiment analysis, and cosine similarity of BERT topic embeddings for vaccine brand reputation analysis. Results suggest that any type of COVID-19 misinformation is an emergent property of COVID-19 public opinion, and that the detection of COVID-19 misinformation can be an unsupervised task. Sentiment analysis aided by hierarchical clustering reveal that 21 of the 25 topics extrapolated by topic modeling are negative topics. Such negative comments spike in count whenever the Department of Health in the Philippines posts about the COVID-19 situation in other countries. Additionally, the high numbers of laugh reactions on the Facebook posts by the same agency -- without any humorous content -- suggest that the reactors of these posts tend to react the way they do, not because of what the posts are about but because of who posted them.
Jasper Kyle Catapang, Jerome V. Cleofas
0
Jupyter Notebook
11/18/2021 Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences for Image-Text Retrieval
Matching model is essential for Image-Text Retrieval framework. Existing research usually train the model with a triplet loss and explore various strategy to retrieve hard negative sentences in the dataset. We argue that current retrieval-based negative sample construction approach is limited in the scale of the dataset thus fail to identify negative sample of high difficulty for every image. We propose our TAiloring neGative Sentences with Discrimination and Correction (TAGS-DC) to generate synthetic sentences automatically as negative samples. TAGS-DC is composed of masking and refilling to generate synthetic negative sentences with higher difficulty. To keep the difficulty during training, we mutually improve the retrieval and generation through parameter sharing. To further utilize fine-grained semantic of mismatch in the negative sentence, we propose two auxiliary tasks, namely word discrimination and word correction to improve the training. In experiments, we verify the effectiveness of our model on MS-COCO and Flickr30K compared with current state-of-the-art models and demonstrates its robustness and faithfulness in the further analysis. Our code is available in this https URL.
Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Jianqing Fan
0
Python
11/18/2021 BagBERT: BERT-based bagging-stacking for multi-topic classification
This paper describes our submission on the COVID-19 literature annotation task at Biocreative VII. We proposed an approach that exploits the knowledge of the globally non-optimal weights, usually rejected, to build a rich representation of each label. Our proposed approach consists of two stages: (1) A bagging of various initializations of the training data that features weakly trained weights, (2) A stacking of heterogeneous vocabulary models based on BERT and RoBERTa Embeddings. The aggregation of these weak insights performs better than a classical globally efficient model. The purpose is the distillation of the richness of knowledge to a simpler and lighter model. Our system obtains an Instance-based F1 of 92.96 and a Label-based micro-F1 of 91.35.
Loic Rakotoson, Charles Letaillieur, Sylvain Massip, Frejus Laleye
-
0
11/18/2021 Finnish Dialect Identification: The Effect of Audio and Text
Finnish is a language with multiple dialects that not only differ from each other in terms of accent (pronunciation) but also in terms of morphological forms and lexical choice. We present the first approach to automatically detect the dialect of a speaker based on a dialect transcript and transcript with audio recording in a dataset consisting of 23 different dialects. Our results show that the best accuracy is received by combining both of the modalities, as text only reaches to an overall accuracy of 57\%, where as text and audio reach to 85\%. Our code, models and data have been released openly on Github and Zenodo.
Mika Hamalainen, Khalid Alnajjar, Niko Partanen, Jack Rueter
0
11/18/2021 Learning to Generalize Compositionally by Transferring Across Semantic Parsing Tasks
Neural network models often generalize poorly to mismatched domains or distributions. In NLP, this issue arises in particular when models are expected to generalize compositionally, that is, to novel combinations of familiar words and constructions. We investigate learning representations that facilitate transfer learning from one compositional task to another: the representation and the task-specific layers of the models are strategically trained differently on a pre-finetuning task such that they generalize well on mismatched splits that require compositionality. We apply this method to semantic parsing, using three very different datasets, COGS, GeoQuery and SCAN, used alternately as the pre-finetuning and target task. Our method significantly improves compositional generalization over baselines on the test set of the target task, which is held out during fine-tuning. Ablation studies characterize the utility of the major steps in the proposed algorithm and support our hypothesis.
Wang Zhu, Peter Shaw, Tal Linzen, Fei Sha
0
11/11/2021 OpenPrompt: An Open-source Framework for Prompt-learning
Prompt-learning has become a new paradigm in modern natural language processing, which directly adapts pre-trained language models (PLMs) to $cloze$-style prediction, autoregressive modeling, or sequence to sequence generation, resulting in promising performances on various tasks. However, no standard implementation framework of prompt-learning is proposed yet, and most existing prompt-learning codebases, often unregulated, only provide limited implementations for specific scenarios. Since there are many details such as templating strategy, initializing strategy, and verbalizing strategy, etc. need to be considered in prompt-learning, practitioners face impediments to quickly adapting the desired prompt learning methods to their applications. In this paper, we present {OpenPrompt}, a unified easy-to-use toolkit to conduct prompt-learning over PLMs. OpenPrompt is a research-friendly framework that is equipped with efficiency, modularity, and extendibility, and its combinability allows the freedom to combine different PLMs, task formats, and prompting modules in a unified paradigm. Users could expediently deploy prompt-learning frameworks and evaluate the generalization of them on different NLP tasks without constraints. OpenPrompt is publicly released at {\url{ this https URL}}.
Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun
852
Python
11/11/2021 LegalNLP -- Natural Language Processing methods for the Brazilian Legal Language
We present and make available pre-trained language models (Phraser, Word2Vec, Doc2Vec, FastText, and BERT) for the Brazilian legal language, a Python package with functions to facilitate their use, and a set of demonstrations/tutorials containing some applications involving them. Given that our material is built upon legal texts coming from several Brazilian courts, this initiative is extremely helpful for the Brazilian legal field, which lacks other open and specific tools and language models. Our main objective is to catalyze the use of natural language processing tools for legal texts analysis by the Brazilian industry, government, and academia, providing the necessary tools and accessible material.
Felipe Maia Polo, Gabriel Caiaffa Floriano Mendonca, Kaue Capellato J. Parreira, Lucka Gianvechio, Peterson Cordeiro, Jonathan Batista Ferreira, Leticia Maria Paz de Lima, Antonio Carlos do Amaral Maia, Renato Vicente
63
Python
11/11/2021 MetaICL: Learning to Learn In Context
Sewon Min, Mike Lewis, Luke Zettlemoyer, Hannaneh Hajishirzi
-
57
Python
11/11/2021 Lexically Aware Semi-Supervised Learning for OCR Post-Correction
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general-purpose OCR systems on recognition of less-well-resourced languages. However, these methods rely on manually curated post-correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically-aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15-29%, where we find the combination of self-training and lexically-aware decoding essential for achieving consistent improvements. Data and code are available at this https URL.
Shruti Rijhwani, Daisy Rosenblum, Antonios Anastasopoulos, Graham Neubig
43
Python
11/11/2021 Skyformer: Remodel Self-Attention with Gaussian Kernel and Nystrom Method
Transformers are expensive to train due to the quadratic time and space complexity in the self-attention mechanism. On the other hand, although kernel machines suffer from the same computation bottleneck in pairwise dot products, several approximation schemes have been successfully incorporated to considerably reduce their computational cost without sacrificing too much accuracy. In this work, we leverage the computation methods for kernel machines to alleviate the high computational cost and introduce Skyformer, which replaces the softmax structure with a Gaussian kernel to stabilize the model training and adapts the Nystrï¿½m method to a non-positive semidefinite matrix to accelerate the computation. We further conduct theoretical analysis by showing that the matrix approximation error of our proposed method is small in the spectral norm. Experiments on Long Range Arena benchmark show that the proposed method is sufficient in getting comparable or even better performance than the full self-attention while requiring fewer computation resources.
Yifan Chen, Qi Zeng, Heng Ji, Yun Yang
32
Python
11/11/2021 SP-GPT2: Semantics Improvement in Vietnamese Poetry Generation
Automatic text generation has garnered growing attention in recent years as an essential step towards computer creativity. Generative Pretraining Transformer 2 (GPT2) is one of the state of the art approaches that have excellent successes. In this paper, we took the first step to investigate the power of GPT2 in traditional Vietnamese poetry generation. In the earlier time, our experiment with base GPT2 was quite good at generating the poem in the proper template. Though it can learn the patterns, including rhyme and tone rules, from the training data, like almost all other text generation approaches, the poems generated still has a topic drift and semantic inconsistency. To improve the cohesion within the poems, we proposed a new model SP-GPT2 (semantic poem GPT2) which was built on the top GPT2 model and an additional loss to constrain context throughout the entire poem. For better evaluation, we examined the methods by both automatic quantitative evaluation and human evaluation. Both automatic and human evaluation demonstrated that our approach can generate poems that have better cohesion without losing the quality due to additional loss. At the same time, we are the pioneers of this topic. We released the first computational scoring module for poems generated in the template containing the style rule dictionary. Additionally, we are the first to publish a Luc-Bat dataset, including 87609 Luc Bat poems, which is equivalent to about 2.6 million sentences, combined with about 83579 poems in other styles was also published for further exploration. The code is available at this https URL
Tuan Nguyen, Hanh Pham, Truong Bui, Tan Nguyen, Duc Luong, Phong Nguyen
20
Python
11/11/2021 LMdiff: A Visual Diff Tool to Compare Language Models
While different language models are ubiquitous in NLP, it is hard to contrast their outputs and identify which contexts one can handle better than the other. To address this question, we introduce LMdiff, a tool that visually compares probability distributions of two models that differ, e.g., through finetuning, distillation, or simply training with different parameter sizes. LMdiff allows the generation of hypotheses about model behavior by investigating text instances token by token and further assists in choosing these interesting text instances by identifying the most interesting phrases from large corpora. We showcase the applicability of LMdiff for hypothesis generation across multiple case studies. A demo is available at this http URL .
Hendrik Strobelt, Benjamin Hoover, Arvind Satyanarayan, Sebastian Gehrmann
16
Python
11/11/2021 Distilling Relation Embeddings from Pre-trained Language Models
Pre-trained language models have been found to capture a surprisingly rich amount of lexical knowledge, ranging from commonsense properties of everyday concepts to detailed factual knowledge about named entities. Among others, this makes it possible to distill high-quality word vectors from pre-trained language models. However, it is currently unclear to what extent it is possible to distill relation embeddings, i.e. vectors that characterize the relationship between two words. Such relation embeddings are appealing because they can, in principle, encode relational knowledge in a more fine-grained way than is possible with knowledge graphs. To obtain relation embeddings from a pre-trained language model, we encode word pairs using a (manually or automatically generated) prompt, and we fine-tune the language model such that relationally similar word pairs yield similar output vectors. We find that the resulting relation embeddings are highly competitive on analogy (unsupervised) and relation classification (supervised) benchmarks, even without any task-specific fine-tuning. Source code to reproduce our experimental results and the model checkpoints are available in the following repository: this https URL
Asahi Ushio, Jose Camacho-Collados, Steven Schockaert
12
Python
11/11/2021 A transfer learning based approach for pronunciation scoring
Phone-level pronunciation scoring is a challenging task, with performance far from that of human annotators. Standard systems generate a score for each phone in a phrase using models trained for automatic speech recognition (ASR) with native data only. Better performance has been shown when using systems that are trained specifically for the task using non-native data. Yet, such systems face the challenge that datasets labelled for this task are scarce and usually small. In this paper, we present a transfer learning-based approach that leverages a model trained for ASR, adapting it for the task of pronunciation scoring. We analyze the effect of several design choices and compare the performance with a state-of-the-art goodness of pronunciation (GOP) system. Our final system is 20% better than the GOP system on EpaDB, a database for pronunciation scoring research, for a cost function that prioritizes low rates of unnecessary corrections.
Marcelo Sancinetti, Jazmin Vidal, Cyntia Bonomi, Luciana Ferrer
-
11
Jupyter Notebook
11/11/2021 Benchmarking Multimodal AutoML for Tabular Data with Text Fields
We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark enables researchers to comprehensively evaluate their own methods for supervised learning with numeric, categorical, and text features. To ensure that any single modeling strategy which performs well over all 18 datasets will serve as a practical foundation for multimodal text/tabular AutoML, the diverse datasets in our benchmark vary greatly in: sample size, problem types (a mix of classification and regression tasks), number of features (with the number of text columns ranging from 1 to 28 between datasets), as well as how the predictive signal is decomposed between text vs. numeric/categorical features (and predictive interactions thereof). Over this benchmark, we evaluate various straightforward pipelines to model such data, including standard two-stage approaches where NLP is used to featurize the text such that AutoML for tabular data can then be applied. Compared with human data science teams, the fully automated methodology that performed best on our benchmark (stack ensembling a multimodal Transformer with various tree models) also manages to rank 1st place when fit to the raw text/tabular data in two MachineHack prediction competitions and 2nd place (out of 2380 teams) in Kaggle's Mercari Price Suggestion Challenge.
Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, Alexander J. Smola
12
Python
11/11/2021 CORAA: a large corpus of spontaneous and prepared speech manually validated for speech recognition in Brazilian Portuguese
Automatic Speech recognition (ASR) is a complex and challenging task. In recent years, there have been significant advances in the area. In particular, for the Brazilian Portuguese (BP) language, there were about 376 hours public available for ASR task until the second half of 2020. With the release of new datasets in early 2021, this number increased to 574 hours. The existing resources, however, are composed of audios containing only read and prepared speech. There is a lack of datasets including spontaneous speech, which are essential in different ASR applications. This paper presents CORAA (Corpus of Annotated Audios) v1. with 291 hours, a publicly available dataset for ASR in BP containing validated pairs (audio-transcription). CORAA also contains European Portuguese audios (4.69 hours). We also present two public ASR models based on Wav2Vec 2.0 XLSR-53 and fine-tuned over CORAA. Our best model achieved a Word Error Rate of 27.35% on CORAA test set and 16.01% on Common Voice test set. When measuring the Character Error Rate, we obtained 14.26% and 5.45% for CORAA and Common Voice, respectively. CORAA corpora were assembled to both improve ASR models in BP with phenomena from spontaneous speech and motivate young researchers to start their studies on ASR for Portuguese. All the corpora are publicly available at this https URL under the CC BY-NC-ND 4.0 license.
Arnaldo Candido Junior, Edresson Casanova, Anderson Soares, Frederico Santos de Oliveira, Lucas Oliveira, Ricardo Corso Fernandes Junior, Daniel Peixoto Pinto da Silva, Fernando Gorgulho Fayet, Bruno Baldissera Carlotto, Lucas Rafael Stefanel Gris, Sandra Maria Aluisio
-
12
11/11/2021 FinEAS: Financial Embedding Analysis of Sentiment
We introduce a new language representation model in finance called Financial Embedding Analysis of Sentiment (FinEAS). In financial markets, news and investor sentiment are significant drivers of security prices. Thus, leveraging the capabilities of modern NLP approaches for financial sentiment analysis is a crucial component in identifying patterns and trends that are useful for market participants and regulators. In recent years, methods that use transfer learning from large Transformer-based language models like BERT, have achieved state-of-the-art results in text classification tasks, including sentiment analysis using labelled datasets. Researchers have quickly adopted these approaches to financial texts, but best practices in this domain are not well-established. In this work, we propose a new model for financial sentiment analysis based on supervised fine-tuned sentence embeddings from a standard BERT model. We demonstrate our approach achieves significant improvements in comparison to vanilla BERT, LSTM, and FinBERT, a financial domain specific BERT.
Asier Gutierrez-Fandino, Miquel Noguer i Alonso, Petter Kolm, Jordi Armengol-Estape
9
Python
11/11/2021 How to Leverage Multimodal EHR Data for Better Medical Predictions?
Healthcare is becoming a more and more important research topic recently. With the growing data in the healthcare domain, it offers a great opportunity for deep learning to improve the quality of medical service. However, the complexity of electronic health records (EHR) data is a challenge for the application of deep learning. Specifically, the data produced in the hospital admissions are monitored by the EHR system, which includes structured data like daily body temperature, and unstructured data like free text and laboratory measurements. Although there are some preprocessing frameworks proposed for specific EHR data, the clinical notes that contain significant clinical value are beyond the realm of their consideration. Besides, whether these different data from various views are all beneficial to the medical tasks and how to best utilize these data remain unclear. Therefore, in this paper, we first extract the accompanying clinical notes from EHR and propose a method to integrate these data, we also comprehensively study the different models and the data leverage methods for better medical task prediction. The results on two medical prediction tasks show that our fused model with different data outperforms the state-of-the-art method that without clinical notes, which illustrates the importance of our fusion method and the value of clinical note features. Our code is available at https: //github.com/emnlp-mimic/mimic.
Bo Yang, Lijun Wu
7
Python
11/11/2021 Enhanced Language Representation with Label Knowledge for Span Extraction
Span extraction, aiming to extract text spans (such as words or phrases) from plain texts, is a fundamental process in Information Extraction. Recent works introduce the label knowledge to enhance the text representation by formalizing the span extraction task into a question answering problem (QA Formalization), which achieves state-of-the-art performance. However, QA Formalization does not fully exploit the label knowledge and suffers from low efficiency in training/inference. To address those problems, we introduce a new paradigm to integrate label knowledge and further propose a novel model to explicitly and efficiently integrate label knowledge into text representations. Specifically, it encodes texts and label annotations independently and then integrates label knowledge into text representation with an elaborate-designed semantics fusion module. We conduct extensive experiments on three typical span extraction tasks: flat NER, nested NER, and event detection. The empirical results show that 1) our method achieves state-of-the-art performance on four benchmarks, and 2) reduces training time and inference time by 76% and 77% on average, respectively, compared with the QA Formalization paradigm. Our code and data are available at this https URL.
Pan Yang, Xin Cong, Zhenyun Sun, Xingwu Liu
7
Jupyter Notebook
11/11/2021 CLUES: Few-Shot Learning Evaluation in Natural Language Understanding
Most recent progress in natural language understanding (NLU) has been driven, in part, by benchmarks such as GLUE, SuperGLUE, SQuAD, etc. In fact, many NLU models have now matched or exceeded "human-level" performance on many tasks in these benchmarks. Most of these benchmarks, however, give models access to relatively large amounts of labeled data for training. As such, the models are provided far more data than required by humans to achieve strong performance. That has motivated a line of work that focuses on improving few-shot learning performance of NLU models. However, there is a lack of standardized evaluation benchmarks for few-shot NLU resulting in different experimental settings in different papers. To help accelerate this line of work, we introduce CLUES (Constrained Language Understanding Evaluation Standard), a benchmark for evaluating the few-shot learning capabilities of NLU models. We demonstrate that while recent models reach human performance when they have access to large amounts of labeled data, there is a huge gap in performance in the few-shot setting for most tasks. We also demonstrate differences between alternative model families and adaptation techniques in the few shot setting. Finally, we discuss several principles and choices in designing the experimental settings for evaluating the true few-shot learning performance and suggest a unified standardized approach to few-shot learning evaluation. We aim to encourage research on NLU models that can generalize to new tasks with a small number of examples. Code and data for CLUES are available at this https URL.
Subhabrata Mukherjee, Xiaodong Liu, Guoqing Zheng, Saghar Hosseini, Hao Cheng, Greg Yang, Christopher Meek, Ahmed Hassan Awadallah, Jianfeng Gao
12
Shell
11/11/2021 SADGA: Structure-Aware Dual Graph Aggregation Network for Text-to-SQL
The Text-to-SQL task, aiming to translate the natural language of the questions into SQL queries, has drawn much attention recently. One of the most challenging problems of Text-to-SQL is how to generalize the trained model to the unseen database schemas, also known as the cross-domain Text-to-SQL task. The key lies in the generalizability of (i) the encoding method to model the question and the database schema and (ii) the question-schema linking method to learn the mapping between words in the question and tables/columns in the database schema. Focusing on the above two key issues, we propose a Structure-Aware Dual Graph Aggregation Network (SADGA) for cross-domain Text-to-SQL. In SADGA, we adopt the graph structure to provide a unified encoding model for both the natural language question and database schema. Based on the proposed unified modeling, we further devise a structure-aware aggregation method to learn the mapping between the question-graph and schema-graph. The structure-aware aggregation method is featured with Global Graph Linking, Local Graph Linking, and Dual-Graph Aggregation Mechanism. We not only study the performance of our proposal empirically but also achieved 3rd place on the challenging Text-to-SQL benchmark Spider at the time of writing.
Ruichu Cai, Jinjie Yuan, Boyan Xu, Zhifeng Hao
5
11/11/2021 DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models
Gigantic pre-trained models have become central to natural language processing (NLP), serving as the starting point for fine-tuning towards a range of downstream tasks. However, two pain points persist for this paradigm: (a) as the pre-trained models grow bigger (e.g., 175B parameters for GPT-3), even the fine-tuning process can be time-consuming and computationally expensive; (b) the fine-tuned model has the same size as its starting point by default, which is neither sensible due to its more specialized functionality, nor practical since many fine-tuned models will be deployed in resource-constrained environments. To address these pain points, we propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning - by enforcing sparsity-aware weight updates on top of the pre-trained weights; and (ii) resource-efficient inference - by encouraging a sparse weight structure towards the final fine-tuned model. We leverage sparsity in these two directions by exploiting both unstructured and structured sparse patterns in pre-trained language models via magnitude-based pruning and $\ell_1$ sparse regularization. Extensive experiments and in-depth investigations, with diverse network backbones (i.e., BERT, GPT-2, and DeBERTa) on dozens of datasets, consistently demonstrate highly impressive parameter-/training-/inference-efficiency, while maintaining competitive downstream transfer performance. For instance, our DSEE-BERT obtains about $35\%$ inference FLOPs savings with <1% trainable parameters and comparable performance to conventional fine-tuning. Codes are available in this https URL.
Xuxi Chen, Tianlong Chen, Yu Cheng, Weizhu Chen, Zhangyang Wang, Ahmed Hassan Awadallah
3
Python
11/11/2021 Paperswithtopic: Topic Identification from Paper Title Only
The deep learning field is growing rapidly as witnessed by the exponential growth of papers submitted to journals, conferences, and pre-print servers. To cope with the sheer number of papers, several text mining tools from natural language processing (NLP) have been proposed that enable researchers to keep track of recent findings. In this context, our paper makes two main contributions: first, we collected and annotated a dataset of papers paired by title and sub-field from the field of artificial intelligence (AI), and, second, we present results on how to predict a paper's AI sub-field from a given paper title only. Importantly, for the latter, short-text classification task we compare several algorithms from conventional machine learning all the way up to recent, larger transformer architectures. Finally, for the transformer models, we also present gradient-based, attention visualizations to further explain the model's classification process. All code can be found at \url{this https URL}
Daehyun Cho, Christian Wallraven
2
Jupyter Notebook
11/11/2021 Learning Personal Food Preferences via Food Logs Embedding
Diet management is key to managing chronic diseases such as diabetes. Automated food recommender systems may be able to assist by providing meal recommendations that conform to a user's nutrition goals and food preferences. Current recommendation systems suffer from a lack of accuracy that is in part due to a lack of knowledge of food preferences, namely foods users like to and are able to eat frequently. In this work, we propose a method for learning food preferences from food logs, a comprehensive but noisy source of information about users' dietary habits. We also introduce accompanying metrics. The method generates and compares word embeddings to identify the parent food category of each food entry and then calculates the most popular. Our proposed approach identifies 82% of a user's ten most frequently eaten foods. Our method is publicly available on (this https URL)
Ahmed A. Metwally, Ariel K. Leong, Aman Desai, Anvith Nagarjuna, Dalia Perelman, Michael Snyder
2
Jupyter Notebook
11/11/2021 Integrating Deep Event-Level and Script-Level Information for Script Event Prediction
Scripts are structured sequences of events together with the participants, which are extracted from the texts.Script event prediction aims to predict the subsequent event given the historical events in the script. Two kinds of information facilitate this task, namely, the event-level information and the script-level information. At the event level, existing studies view an event as a verb with its participants, while neglecting other useful properties, such as the state of the participants. At the script level, most existing studies only consider a single event sequence corresponding to one common protagonist. In this paper, we propose a Transformer-based model, called MCPredictor, which integrates deep event-level and script-level information for script event prediction. At the event level, MCPredictor utilizes the rich information in the text to obtain more comprehensive event semantic representations. At the script-level, it considers multiple event sequences corresponding to different participants of the subsequent event. The experimental results on the widely-used New York Times corpus demonstrate the effectiveness and superiority of the proposed model.
Long Bai, Saiping Guan, Jiafeng Guo, Zixuan Li, Xiaolong Jin, Xueqi Cheng
3
Python
11/11/2021 Introspective Distillation for Robust Question Answering
Question answering (QA) models are well-known to exploit data bias, e.g., the language prior in visual QA and the position bias in reading comprehension. Recent debiasing methods achieve good out-of-distribution (OOD) generalizability with a considerable sacrifice of the in-distribution (ID) performance. Therefore, they are only applicable in domains where the test distribution is known in advance. In this paper, we present a novel debiasing method called Introspective Distillation (IntroD) to make the best of both worlds for QA. Our key technical contribution is to blend the inductive bias of OOD and ID by introspecting whether a training sample fits in the factual ID world or the counterfactual OOD one. Experiments on visual QA datasets VQA v2, VQA-CP, and reading comprehension dataset SQuAD demonstrate that our proposed IntroD maintains the competitive OOD performance compared to other debiasing methods, while sacrificing little or even achieving better ID performance compared to the non-debiasing ones.
Yulei Niu, Hanwang Zhang
-
3
11/11/2021 A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion Mining
User-generated content from social media is produced in many languages, making it technically challenging to compare the discussed themes from one domain across different cultures and regions. It is relevant for domains in a globalized world, such as market research, where people from two nations and markets might have different requirements for a product. We propose a simple, modern, and effective method for building a single topic model with sentiment analysis capable of covering multiple languages simultanteously, based on a pre-trained state-of-the-art deep neural network for natural language understanding. To demonstrate its feasibility, we apply the model to newspaper articles and user comments of a specific domain, i.e., organic food products and related consumption behavior. The themes match across languages. Additionally, we obtain an high proportion of stable and domain-relevant topics, a meaningful relation between topics and their respective textual contents, and an interpretable representation for social media documents. Marketing can potentially benefit from our method, since it provides an easy-to-use means of addressing specific customer interests from different market regions around the globe. For reproducibility, we provide the code, data, and results of our study.
Gerhard Hagerer, Wing Sheung Leung, Qiaoxi Liu, Hannah Danner, Georg Groh
1
Jupyter Notebook
11/11/2021 Hierarchical Deep Residual Reasoning for Temporal Moment Localization
Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different resolution and sentences with different length have different difficulty in understanding, we design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner. Extensive experiments conducted on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our HDRR model compared with other state-of-the-art methods.
Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, Liqiang Nie
2
Python
11/11/2021 Handshakes AI Research at CASE 2021 Task 1: Exploring different approaches for multilingual tasks
The aim of the CASE 2021 Shared Task 1 (Hï¿½rriyeto?lu et al., 2021) was to detect and classify socio-political and crisis event information at document, sentence, cross-sentence, and token levels in a multilingual setting, with each of these subtasks being evaluated separately in each test language. Our submission contained entries in all of the subtasks, and the scores obtained validated our research finding: That the multilingual aspect of the tasks should be embraced, so that modeling and training regimes use the multilingual nature of the tasks to their mutual benefit, rather than trying to tackle the different languages separately. Our code is available at this https URL
Vivek Kalyan, Paul Tan, Shaun Tan, Martin Andrews
-
1
Python
11/11/2021 Social Media Reveals Urban-Rural Differences in Stress across China
Modeling differential stress expressions in urban and rural regions in China can provide a better understanding of the effects of urbanization on psychological well-being in a country that has rapidly grown economically in the last two decades. This paper studies linguistic differences in the experiences and expressions of stress in urban-rural China from Weibo posts from over 65,000 users across 329 counties using hierarchical mixed-effects models. We analyzed phrases, topical themes, and psycho-linguistic word choices in Weibo posts mentioning stress to better understand appraisal differences surrounding psychological stress in urban and rural communities in China; we then compared them with large-scale polls from Gallup. After controlling for socioeconomic and gender differences, we found that rural communities tend to express stress in emotional and personal themes such as relationships, health, and opportunity while users in urban areas express stress using relative, temporal, and external themes such as work, politics, and economics. These differences exist beyond controlling for GDP and urbanization, indicating a fundamentally different lifestyle between rural and urban residents in very specific environments, arguably having different sources of stress. We found corroborative trends in physical, financial, and social wellness with urbanization in Gallup polls.
Jesse Cui, Tingdan Zhang, Kokil Jaidka, Dandan Pang, Garrick Sherman, Vinit Jakhetiya, Lyle Ungar, Sharath Chandra Guntuku
-
1
R
11/11/2021 Neural sentence embedding models for semantic similarity estimation in the biomedical domain
BACKGROUND: In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set. RESULTS: With a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson's r (r=0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models' performance on the smaller contradiction subset to be poor. CONCLUSIONS: In this study we highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work.
Kathrin Blagec, Hong Xu, Asan Agibetov, Matthias Samwald
1
Jupyter Notebook
11/11/2021 An Explanation of In-context Learning as Implicit Bayesian Inference
Large pretrained language models such as GPT-3 have the surprising ability to do in-context learning, where the model learns to do a downstream task simply by conditioning on a prompt consisting of input-output examples. Without being explicitly pretrained to do so, the language model learns from these examples during its forward pass without parameter updates on "out-of-distribution" prompts. Thus, it is unclear what mechanism enables in-context learning. In this paper, we study the role of the pretraining distribution on the emergence of in-context learning under a mathematical setting where the pretraining texts have long-range coherence. Here, language model pretraining requires inferring a latent document-level concept from the conditioning text to generate coherent next tokens. At test time, this mechanism enables in-context learning by inferring the shared latent concept between prompt examples and applying it to make a prediction on the test example. Concretely, we prove that in-context learning occurs implicitly via Bayesian inference of the latent concept when the pretraining distribution is a mixture of HMMs. This can occur despite the distribution mismatch between prompts and pretraining data. In contrast to messy large-scale pretraining datasets for in-context learning in natural language, we generate a family of small-scale synthetic datasets (GINC) where Transformer and LSTM language models both exhibit in-context learning. Beyond the theory which focuses on the effect of the pretraining distribution, we empirically find that scaling model size improves in-context accuracy even when the pretraining loss is the same.
Sang Michael Xie, Aditi Raghunathan, Percy Liang, Tengyu Ma
10
Jupyter Notebook
11/11/2021 A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition
The audio-video based multimodal emotion recognition has attracted a lot of attention due to its robust performance. Most of the existing methods focus on proposing different cross-modal fusion strategies. However, these strategies introduce redundancy in the features of different modalities without fully considering the complementary properties between modal information, and these approaches do not guarantee the non-loss of original semantic information during intra- and inter-modal interactions. In this paper, we propose a novel cross-modal fusion network based on self-attention and residual structure (CFN-SR) for multimodal emotion recognition. Firstly, we perform representation learning for audio and video modalities to obtain the semantic features of the two modalities by efficient ResNeXt and 1D CNN, respectively. Secondly, we feed the features of the two modalities into the cross-modal blocks separately to ensure efficient complementarity and completeness of information through the self-attention mechanism and residual structure. Finally, we obtain the output of emotions by splicing the obtained fused representation with the original representation. To verify the effectiveness of the proposed method, we conduct experiments on the RAVDESS dataset. The experimental results show that the proposed CFN-SR achieves the state-of-the-art and obtains 75.76% accuracy with 26.30M parameters. Our code is available at this https URL.
Ziwang Fu, Feng Liu, Hanyang Wang, Jiayin Qi, Xiangling Fu, Aimin Zhou, Zhibin Li
2
Python
11/11/2021 Evaluating robustness of You Only Hear Once(YOHO) Algorithm on noisy audios in the VOICe Dataset
Sound event detection (SED) in machine listening entails identifying the different sounds in an audio file and identifying the start and end time of a particular sound event in the audio. SED finds use in various applications such as audio surveillance, speech recognition, and context-based indexing and retrieval of data in a multimedia database. However, in real-life scenarios, the audios from various sources are seldom devoid of any interfering noise or disturbance. In this paper, we test the performance of the You Only Hear Once (YOHO) algorithm on noisy audio data. Inspired by the You Only Look Once (YOLO) algorithm in computer vision, the YOHO algorithm can match the performance of the various state-of-the-art algorithms on datasets such as Music Speech Detection Dataset, TUT Sound Event, and Urban-SED datasets but at lower inference times. In this paper, we explore the performance of the YOHO algorithm on the VOICe dataset containing audio files with noise at different sound-to-noise ratios (SNR). YOHO could outperform or at least match the best performing SED algorithms reported in the VOICe dataset paper and make inferences in less time.
Soham Tiwari, Kshitiz Lakhotia, Manjunath Mulimani
1
Jupyter Notebook
11/11/2021 Classification of hierarchical text using geometric deep learning: the case of clinical trials corpus
We consider the hierarchical representation of documents as graphs and use geometric deep learning to classify them into different categories. While graph neural networks can efficiently handle the variable structure of hierarchical documents using the permutation invariant message passing operations, we show that we can gain extra performance improvements using our proposed selective graph pooling operation that arises from the fact that some parts of the hierarchy are invariable across different documents. We applied our model to classify clinical trial (CT) protocols into completed and terminated categories. We use bag-of-words based, as well as pre-trained transformer-based embeddings to featurize the graph nodes, achieving f1-scores around 0.85 on a publicly available large scale CT registry of around 360K protocols. We further demonstrate how the selective pooling can add insights into the CT termination status prediction. We make the source code and dataset splits accessible.
Sohrab Ferdowsi, Nikolay Borissov, Julien Knafou, Poorya Amini, Douglas Teodoro
1
Python
11/11/2021 End-to-End Annotator Bias Approximation on Crowdsourced Single-Label Sentiment Analysis
Sentiment analysis is often a crowdsourcing task prone to subjective labels given by many annotators. It is not yet fully understood how the annotation bias of each annotator can be modeled correctly with state-of-the-art methods. However, resolving annotator bias precisely and reliably is the key to understand annotators' labeling behavior and to successfully resolve corresponding individual misconceptions and wrongdoings regarding the annotation task. Our contribution is an explanation and improvement for precise neural end-to-end bias modeling and ground truth estimation, which reduces an undesired mismatch in that regard of the existing state-of-the-art. Classification experiments show that it has potential to improve accuracy in cases where each sample is annotated only by one single annotator. We provide the whole source code publicly and release an own domain-specific sentiment dataset containing 10,000 sentences discussing organic food products. These are crawled from social media and are singly labeled by 10 non-expert annotators.
Gerhard Hagerer, David Szabo, Andreas Koch, Maria Luisa Ripoll Dominguez, Christian Widmer, Maximilian Wich, Hannah Danner, Georg Groh
1
Jupyter Notebook
11/11/2021 Outlining and Filling: Hierarchical Query Graph Generation for Answering Complex Questions over Knowledge Graph
Query graph building aims to build correct executable SPARQL over the knowledge graph for answering natural language questions. Although recent approaches perform well by NN-based query graph ranking, more complex questions bring three new challenges: complicated SPARQL syntax, huge search space for ranking, and noisy query graphs with local ambiguity. This paper handles these challenges. Initially, we regard common complicated SPARQL syntax as the sub-graphs comprising of vertices and edges and propose a new unified query graph grammar to adapt them. Subsequently, we propose a new two-stage approach to build query graphs. In the first stage, the top-$k$ related instances (entities, relations, etc.) are collected by simple strategies, as the candidate instances. In the second stage, a graph generation model performs hierarchical generation. It first outlines a graph structure whose vertices and edges are empty slots, and then fills the appropriate instances into the slots, thereby completing the query graph. Our approach decomposes the unbearable search space of entire query graphs into affordable sub-spaces of operations, meanwhile, leverages the global structural information to eliminate local ambiguity. The experimental results demonstrate that our approach greatly improves state-of-the-art on the hardest KGQA benchmarks and has an excellent performance on complex questions.
Yongrui Chen, Huiying Li, Guilin Qi, Tianxing Wu, Tenggou Wang
-
1
Python
11/11/2021 Automatic Embedding of Stories Into Collections of Independent Media
We look at how machine learning techniques that derive properties of items in a collection of independent media can be used to automatically embed stories into such collections. To do so, we use models that extract the tempo of songs to make a music playlist follow a narrative arc. Our work specifies an open-source tool that uses pre-trained neural network models to extract the global tempo of a set of raw audio files and applies these measures to create a narrative-following playlist. This tool is available at this https URL
Dylan R. Ashley, Vincent Herrmann, Zachary Friggstad, Kory W. Mathewson, Jurgen Schmidhuber
0
Python
11/11/2021 Revealing and Protecting Labels in Distributed Training
Distributed learning paradigms such as federated learning often involve transmission of model updates, or gradients, over a network, thereby avoiding transmission of private data. However, it is possible for sensitive information about the training data to be revealed from such gradients. Prior works have demonstrated that labels can be revealed analytically from the last layer of certain models (e.g., ResNet), or they can be reconstructed jointly with model inputs by using Gradients Matching [Zhu et al'19] with additional knowledge about the current state of the model. In this work, we propose a method to discover the set of labels of training samples from only the gradient of the last layer and the id to label mapping. Our method is applicable to a wide variety of model architectures across multiple domains. We demonstrate the effectiveness of our method for model training in two domains - image classification, and automatic speech recognition. Furthermore, we show that existing reconstruction techniques improve their efficacy when used in conjunction with our method. Conversely, we demonstrate that gradient quantization and sparsification can significantly reduce the success of the attack.
Trung Dang, Om Thakkar, Swaroop Ramaswamy, Rajiv Mathews, Peter Chin, Francoise Beaufays
0
Python
11/11/2021 System Combination for Grammatical Error Correction Based on Integer Programming
In this paper, we propose a system combination method for grammatical error correction (GEC), based on nonlinear integer programming (IP). Our method optimizes a novel F score objective based on error types, and combines multiple end-to-end GEC systems. The proposed IP approach optimizes the selection of a single best system for each grammatical error type present in the data. Experiments of the IP approach on combining state-of-the-art standalone GEC systems show that the combined system outperforms all standalone systems. It improves F0.5 score by 3.61% when combining the two best participating systems in the BEA 2019 shared task, and achieves F0.5 score of 73.08%. We also perform experiments to compare our IP approach with another state-of-the-art system combination method for GEC, demonstrating IP's competitive combination capability.
Ruixi Lin, Hwee Tou Ng
0
Macaulay2
11/11/2021 Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity
Speech processing systems currently do not support the vast majority of languages, in part due to the lack of data in low-resource languages. Cross-lingual transfer offers a compelling way to help bridge this digital divide by incorporating high-resource data into low-resource systems. Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages. However, scaling up speech systems to support hundreds of low-resource languages remains unsolved. To help bridge this gap, we propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages. We demonstrate the effectiveness of our approach in language family classification, speech recognition, and speech synthesis tasks.
Peter Wu, Jiatong Shi, Yifan Zhong, Shinji Watanabe, Alan W Black
0
11/11/2021 Low-Cost Algorithmic Recourse for Users With Uncertain Cost Functions
The problem of identifying algorithmic recourse for people affected by machine learning model decisions has received much attention recently. Some recent works model user-incurred cost, which is directly linked to user satisfaction. But they assume a single global cost function that is shared across all users. This is an unrealistic assumption when users have dissimilar preferences about their willingness to act upon a feature and different costs associated with changing that feature. In this work, we formalize the notion of user-specific cost functions and introduce a new method for identifying actionable recourses for users. By default, we assume that users' cost functions are hidden from the recourse method, though our framework allows users to partially or completely specify their preferences or cost function. We propose an objective function, Expected Minimum Cost (EMC), based on two key ideas: (1) when presenting a set of options to a user, it is vital that there is at least one low-cost solution the user could adopt; (2) when we do not know the user's true cost function, we can approximately optimize for user satisfaction by first sampling plausible cost functions, then finding a set that achieves a good cost for the user in expectation. We optimize EMC with a novel discrete optimization algorithm, Cost-Optimized Local Search (COLS), which is guaranteed to improve the recourse set quality over iterations. Experimental evaluation on popular real-world datasets with simulated user costs demonstrates that our method satisfies up to 25.89 percentage points more users compared to strong baseline methods. Using standard fairness metrics, we also show that our method can provide more fair solutions across demographic groups than comparable methods, and we verify that our method is robust to misspecification of the cost function distribution.
Prateek Yadav, Peter Hase, Mohit Bansal
0
Python
11/11/2021 Interpretable contrastive word mover's embedding
This paper shows that a popular approach to the supervised embedding of documents for classification, namely, contrastive Word Mover's Embedding, can be significantly enhanced by adding interpretability. This interpretability is achieved by incorporating a clustering promoting mechanism into the contrastive loss. On several public datasets, we show that our method improves significantly upon existing baselines while providing interpretation to the clusters via identifying a set of keywords that are the most representative of a particular class. Our approach was motivated in part by the need to develop Natural Language Processing (NLP) methods for the \textit{novel problem of assessing student work for scientific writing and thinking} - a problem that is central to the area of (educational) Learning Sciences (LS). In this context, we show that our approach leads to a meaningful assessment of the student work related to lab reports from a biology class and can help LS researchers gain insights into student understanding and assess evidence of scientific thought processes.
Ruijie Jiang, Julia Gouvea, Eric Miller, David Hammer, Shuchin Aeron
0
Python
11/11/2021 Discovering Non-monotonic Autoregressive Orderings with Variational Inference
The predominant approach for language modeling is to process sequences from left to right, but this eliminates a source of information: the order by which the sequence was generated. One strategy to recover this information is to decode both the content and ordering of tokens. Existing approaches supervise content and ordering by designing problem-specific loss functions and pre-training with an ordering pre-selected. Other recent works use iterative search to discover problem-specific orderings for training, but suffer from high time complexity and cannot be efficiently parallelized. We address these limitations with an unsupervised parallelizable learner that discovers high-quality generation orders purely from training data -- no domain knowledge required. The learner contains an encoder network and decoder language model that perform variational inference with autoregressive orders (represented as permutation matrices) as latent variables. The corresponding ELBO is not differentiable, so we develop a practical algorithm for end-to-end optimization using policy gradients. We implement the encoder as a Transformer with non-causal attention that outputs permutations in one forward pass. Permutations then serve as target generation orders for training an insertion-based Transformer language model. Empirical results in language modeling tasks demonstrate that our method is context-aware and discovers orderings that are competitive with or even better than fixed orders.
Xuanlin Li, Brandon Trabucco, Dong Huk Park, Michael Luo, Sheng Shen, Trevor Darrell, Yang Gao
2
Python
11/11/2021 Hidden Markov Based Mathematical Model dedicated to Extract Ingredients from Recipe Text
Natural Language Processing (NLP) is a branch of artificial intelligence that gives machines the ability to decode human languages. Partof-speech tagging (POS tagging) is a pre-processing task that requires an annotated corpus. Rule-based and stochastic methods showed remarkable results for POS tag prediction. On this work, I performed a mathematical model based on Hidden Markov structures and I obtained a high-level accuracy of ingredients extracted from text recipe with performances greater than what traditional methods could make without unknown words consideration.
Zied Baklouti (UP, ENIT)
-
0
Python
11/11/2021 Amendable Generation for Dialogue State Tracking
In task-oriented dialogue systems, recent dialogue state tracking methods tend to perform one-pass generation of the dialogue state based on the previous dialogue state. The mistakes of these models made at the current turn are prone to be carried over to the next turn, causing error propagation. In this paper, we propose a novel Amendable Generation for Dialogue State Tracking (AG-DST), which contains a two-pass generation process: (1) generating a primitive dialogue state based on the dialogue of the current turn and the previous dialogue state, and (2) amending the primitive dialogue state from the first pass. With the additional amending generation pass, our model is tasked to learn more robust dialogue state tracking by amending the errors that still exist in the primitive dialogue state, which plays the role of reviser in the double-checking process and alleviates unnecessary error propagation. Experimental results show that AG-DST significantly outperforms previous works in two active DST datasets (MultiWOZ 2.2 and WOZ 2.0), achieving new state-of-the-art performances.
Xin Tian, Liankai Huang, Yingzhan Lin, Siqi Bao, Huang He, Yunyi Yang, Hua Wu, Fan Wang, Shuqi Sun
n/a
11/4/2021 s2s-ft: Fine-Tuning Pretrained Transformer Encoders for Sequence-to-Sequence Learning
Pretrained bidirectional Transformers, such as BERT, have achieved significant improvements in a wide variety of language understanding tasks, while it is not straightforward to directly apply them for natural language generation. In this paper, we present a sequence-to-sequence fine-tuning toolkit s2s-ft, which adopts pretrained Transformers for conditional generation tasks. Inspired by UniLM, we implement three sequence-to-sequence fine-tuning algorithms, namely, causal fine-tuning, masked fine-tuning, and pseudo-masked fine-tuning. By leveraging the existing pretrained bidirectional Transformers, experimental results show that s2s-ft achieves strong performance on several benchmarks of abstractive summarization, and question generation. Moreover, we demonstrate that the package s2s-ft supports both monolingual and multilingual NLG tasks. The s2s-ft toolkit is available at this https URL.
Hangbo Bao, Li Dong, Wenhui Wang, Nan Yang, Furu Wei
3814
Python
11/4/2021 WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing
Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. In this paper, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation. We first equip the Transformer structure with gated relative position bias to improve its capability on recognition tasks. For better speaker discrimination, we propose an utterance mixing training strategy, where additional overlapped utterances are created unsupervisely and incorporated during model training. Lastly, we scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks.
Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Micheal Zeng, Furu Wei
3814
Python
11/4/2021 SgSum: Transforming Multi-document Summarization into Sub-graph Selection
Most of existing extractive multi-document summarization (MDS) methods score each sentence individually and extract salient sentences one by one to compose a summary, which have two main drawbacks: (1) neglecting both the intra and cross-document relations between sentences; (2) neglecting the coherence and conciseness of the whole summary. In this paper, we propose a novel MDS framework (SgSum) to formulate the MDS task as a sub-graph selection problem, in which source documents are regarded as a relation graph of sentences (e.g., similarity graph or discourse graph) and the candidate summaries are its sub-graphs. Instead of selecting salient sentences, SgSum selects a salient sub-graph from the relation graph as the summary. Comparing with traditional methods, our method has two main advantages: (1) the relations between sentences are captured by modeling both the graph structure of the whole document set and the candidate sub-graphs; (2) directly outputs an integrate summary in the form of sub-graph which is more informative and coherent. Extensive experiments on MultiNews and DUC datasets show that our proposed method brings substantial improvements over several strong baselines. Human evaluation results also demonstrate that our model can produce significantly more coherent and informative summaries compared with traditional MDS methods. Moreover, the proposed architecture has strong transfer ability from single to multi-document input, which can reduce the resource bottleneck in MDS tasks. Our code and results are available at: \url{this https URL}.
Moye Chen, Wei Li, Jiachen Liu, Xinyan Xiao, Hua Wu, Haifeng Wang
1072
Python
11/4/2021 SCENIC: A JAX Library for Computer Vision Research and Beyond
Scenic is an open-source JAX library with a focus on Transformer-based models for computer vision research and beyond. The goal of this toolkit is to facilitate rapid experimentation, prototyping, and research of new vision architectures and models. Scenic supports a diverse range of vision tasks (e.g., classification, segmentation, detection)and facilitates working on multi-modal problems, along with GPU/TPU support for multi-host, multi-device large-scale training. Scenic also offers optimized implementations of state-of-the-art research models spanning a wide range of modalities. Scenic has been successfully used for numerous projects and published papers and continues serving as the library of choice for quick prototyping and publication of new research ideas.
Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab, Matthias Minderer, Yi Tay
-
434
Python
11/4/2021 Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
The Transformer architecture has improved the performance of deep learning models in domains such as Computer Vision and Natural Language Processing. Together with better performance come larger model sizes. This imposes challenges to the memory wall of the current accelerator hardware such as GPU. It is never ideal to train large models such as Vision Transformer, BERT, and GPT on a single GPU or a single machine. There is an urgent demand to train models in a distributed environment. However, distributed training, especially model parallelism, often requires domain expertise in computer systems and architecture. It remains a challenge for AI researchers to implement complex distributed training solutions for their models. In this paper, we introduce Colossal-AI, which is a unified parallel training system designed to seamlessly integrate different paradigms of parallelization techniques including data parallelism, pipeline parallelism, multiple tensor parallelism, and sequence parallelism. Colossal-AI aims to support the AI community to write distributed models in the same way as how they write models normally. This allows them to focus on developing the model architecture and separates the concerns of distributed training from the development process. The documentations can be found at this https URL and the source code can be found at this https URL.
Zhengda Bian, Hongxin Liu, Boxiang Wang, Haichen Huang, Yongbin Li, Chuanrui Wang, Fan Cui, Yang You
397
Python
11/4/2021 Training Verifiers to Solve Math Word Problems
State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman
-
142
Python
11/4/2021 TEXTOIR: An Integrated and Visualized Platform for Text Open Intent Recognition
TEXTOIR is the first integrated and visualized platform for text open intent recognition. It is composed of two main modules: open intent detection and open intent discovery. Each module integrates most of the state-of-the-art algorithms and benchmark intent datasets. It also contains an overall framework connecting the two modules in a pipeline scheme. In addition, this platform has visualized tools for data and model management, training, evaluation and analysis of the performance from different aspects. TEXTOIR provides useful toolkits and convenient visualized interfaces for each sub-module (Toolkit code: this https URL), and designs a framework to implement a complete process to both identify known intents and discover open intents (Demo code: this https URL).
Hanlei Zhang, Xiaoteng Li, Hua Xu, Panpan Zhang, Kang Zhao, Kai Gao
-
25
Python
11/4/2021 SciCap: Generating Captions for Scientific Figures
Researchers use figures to communicate rich, complex information in scientific papers. The captions of these figures are critical to conveying effective messages. However, low-quality figure captions commonly occur in scientific articles and may decrease understanding. In this paper, we propose an end-to-end neural framework to automatically generate informative, high-quality captions for scientific figures. To this end, we introduce SCICAP, a large-scale figure-caption dataset based on computer science arXiv papers published between 2010 and 2020. After pre-processing - including figure-type classification, sub-figure identification, text normalization, and caption text selection - SCICAP contained more than two million figures extracted from over 290,000 papers. We then established baseline models that caption graph plots, the dominant (19.2%) figure type. The experimental results showed both opportunities and steep challenges of generating captions for scientific figures.
Ting-Yao Hsu, C. Lee Giles, Ting-Hao 'Kenneth' Huang
-
15