Skip to main content

Showing 1–50 of 67 results for author: Vosoughi, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.01408  [pdf, other

    cs.CV cs.AI cs.LG

    Semantic Compositions Enhance Vision-Language Contrastive Learning

    Authors: Maxwell Aladago, Lorenzo Torresani, Soroush Vosoughi

    Abstract: In the field of vision-language contrastive learning, models such as CLIP capitalize on matched image-caption pairs as positive examples and leverage within-batch non-matching pairs as negatives. This approach has led to remarkable outcomes in zero-shot image classification, cross-modal retrieval, and linear evaluation tasks. We show that the zero-shot classification and retrieval capabilities of… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  2. arXiv:2406.15981  [pdf, other

    cs.CL

    Serial Position Effects of Large Language Models

    Authors: Xiaobo Guo, Soroush Vosoughi

    Abstract: Large Language Models (LLMs) have shown remarkable capabilities in zero-shot learning applications, generating responses to queries using only pre-training information without the need for additional fine-tuning. This represents a significant departure from traditional machine learning approaches. Previous research has indicated that LLMs may exhibit serial position effects, such as primacy and re… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  3. arXiv:2406.07791  [pdf, other

    cs.CL cs.AI

    Judging the Judges: A Systematic Investigation of Position Bias in Pairwise Comparative Assessments by LLMs

    Authors: Lin Shi, Weicheng Ma, Soroush Vosoughi

    Abstract: LLM-as-a-Judge offers a promising alternative to human judges across various tasks, yet inherent biases, particularly position bias - a systematic preference for answers based on their position in the prompt - compromise its effectiveness. Our study investigates this issue by develo** a framework to systematically study and quantify position bias using metrics such as repetitional consistency, p… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: 70 pages, around 200 figures and subfigures

  4. arXiv:2406.03479  [pdf, other

    cs.CL

    MODABS: Multi-Objective Learning for Dynamic Aspect-Based Summarization

    Authors: Xiaobo Guo, Soroush Vosoughi

    Abstract: The rapid proliferation of online content necessitates effective summarization methods, among which dynamic aspect-based summarization stands out. Unlike its traditional counterpart, which assumes a fixed set of known aspects, this approach adapts to the varied aspects of the input text. We introduce a novel multi-objective learning framework employing a Longformer-Encoder-Decoder for this task. T… ▽ More

    Submitted 17 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

  5. arXiv:2405.16584  [pdf, other

    cs.CL

    MentalManip: A Dataset For Fine-grained Analysis of Mental Manipulation in Conversations

    Authors: Yuxin Wang, Ivory Yang, Saeed Hassanpour, Soroush Vosoughi

    Abstract: Mental manipulation, a significant form of abuse in interpersonal conversations, presents a challenge to identify due to its context-dependent and often subtle nature. The detection of manipulative language is essential for protecting potential victims, yet the field of Natural Language Processing (NLP) currently faces a scarcity of resources and research on this topic. Our study addresses this ga… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

    Comments: Accepted at ACL 2024

  6. arXiv:2402.10554  [pdf, other

    cs.CL

    Disordered-DABS: A Benchmark for Dynamic Aspect-Based Summarization in Disordered Texts

    Authors: Xiaobo Guo, Soroush Vosoughi

    Abstract: Aspect-based summarization has seen significant advancements, especially in structured text. Yet, summarizing disordered, large-scale texts, like those found in social media and customer feedback, remains a significant challenge. Current research largely targets predefined aspects within structured texts, neglecting the complexities of dynamic and disordered environments. Addressing this gap, we i… ▽ More

    Submitted 17 June, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

  7. arXiv:2311.01732  [pdf, other

    cs.CL

    Proto-lm: A Prototypical Network-Based Framework for Built-in Interpretability in Large Language Models

    Authors: Sean Xie, Soroush Vosoughi, Saeed Hassanpour

    Abstract: Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), but their lack of interpretability has been a major concern. Current methods for interpreting LLMs are post hoc, applied after inference time, and have limitations such as their focus on low-level features and lack of explainability at higher level text units. In this work, we introduce proto-l… ▽ More

    Submitted 11 November, 2023; v1 submitted 3 November, 2023; originally announced November 2023.

    Comments: Accepted to the Findings of EMNLP 2023

  8. arXiv:2310.12334  [pdf, other

    cs.CV

    Improving Representation Learning for Histopathologic Images with Cluster Constraints

    Authors: Weiyi Wu, Chongyang Gao, Joseph DiPalma, Soroush Vosoughi, Saeed Hassanpour

    Abstract: Recent advances in whole-slide image (WSI) scanners and computational capabilities have significantly propelled the application of artificial intelligence in histopathology slide analysis. While these strides are promising, current supervised learning approaches for WSI analysis come with the challenge of exhaustively labeling high-resolution slides - a process that is both labor-intensive and tim… ▽ More

    Submitted 14 November, 2023; v1 submitted 18 October, 2023; originally announced October 2023.

    Comments: Accepted by ICCV2023

    Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 21404-21414

  9. arXiv:2310.03291  [pdf, other

    cs.CV

    Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction

    Authors: Yiren Jian, Tingkai Liu, Yunzhe Tao, Chunhui Zhang, Soroush Vosoughi, Hongxia Yang

    Abstract: In this paper, we introduce $\text{EVL}_{\text{Gen}}$, a streamlined framework designed for the pre-training of visually conditioned language generation models with high computational demands, utilizing frozen pre-trained large language models (LLMs). The conventional approach in vision-language pre-training (VLP) typically involves a two-stage optimization process: an initial resource-intensive p… ▽ More

    Submitted 21 February, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

  10. Joint Latent Topic Discovery and Expectation Modeling for Financial Markets

    Authors: Lili Wang, Chenghan Huang, Chongyang Gao, Weicheng Ma, Soroush Vosoughi

    Abstract: In the pursuit of accurate and scalable quantitative methods for financial market analysis, the focus has shifted from individual stock models to those capturing interrelations between companies and their stocks. However, current relational stock methods are limited by their reliance on predefined stock relationships and the exclusive consideration of immediate effects. To address these limitation… ▽ More

    Submitted 31 May, 2023; originally announced July 2023.

    Comments: In Advances in Knowledge Discovery and Data Mining 2023 (PAKDD 2023)

  11. arXiv:2307.07063  [pdf, other

    cs.CV cs.LG

    Bootstrap** Vision-Language Learning with Decoupled Language Pre-training

    Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi

    Abstract: We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, speci… ▽ More

    Submitted 19 December, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

    Comments: Accepted to NeurIPS 2023 (spotlight). The code is available at https://github.com/yiren-jian/BLIText

  12. arXiv:2306.01012  [pdf, other

    cs.LG cs.AI cs.SI

    Graph-Level Embedding for Time-Evolving Graphs

    Authors: Lili Wang, Chenghan Huang, Weicheng Ma, Xinyuan Cao, Soroush Vosoughi

    Abstract: Graph representation learning (also known as network embedding) has been extensively researched with varying levels of granularity, ranging from nodes to graphs. While most prior work in this area focuses on node-level representation, limited research has been conducted on graph-level embedding, particularly for dynamic or temporal networks. However, learning low-dimensional graph-level representa… ▽ More

    Submitted 31 May, 2023; originally announced June 2023.

    Comments: In Companion Proceedings of the ACM Web Conference 2023

  13. arXiv:2305.16960  [pdf, ps, other

    cs.CL cs.AI cs.CY cs.HC

    Training Socially Aligned Language Models on Simulated Social Interactions

    Authors: Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M. Dai, Diyi Yang, Soroush Vosoughi

    Abstract: Social alignment in AI systems aims to ensure that these models behave according to established societal values. However, unlike humans, who derive consensus on value judgments through social interaction, current language models (LMs) are trained to rigidly replicate their training corpus in isolation, leading to subpar generalization in unfamiliar scenarios and vulnerability to adversarial attack… ▽ More

    Submitted 28 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Code, data, and models can be downloaded via https://github.com/agi-templar/Stable-Alignment

  14. arXiv:2302.06120  [pdf, other

    q-bio.QM cs.LG

    Knowledge from Large-Scale Protein Contact Prediction Models Can Be Transferred to the Data-Scarce RNA Contact Prediction Task

    Authors: Yiren Jian, Chongyang Gao, Chen Zeng, Yunjie Zhao, Soroush Vosoughi

    Abstract: RNA, whose functionality is largely determined by its structure, plays an important role in many biological activities. The prediction of pairwise structural proximity between each nucleotide of an RNA sequence can characterize the structural information of the RNA. Historically, this problem has been tackled by machine learning models using expert-engineered features and trained on scarce labeled… ▽ More

    Submitted 18 January, 2024; v1 submitted 13 February, 2023; originally announced February 2023.

    Comments: The code is available at https://github.com/yiren-jian/CoT-RNA-Transfer

  15. arXiv:2302.03183  [pdf, other

    cs.CL

    Capturing Topic Framing via Masked Language Modeling

    Authors: Xiaobo Guo, Weicheng Ma, Soroush Vosoughi

    Abstract: Differential framing of issues can lead to divergent world views on important issues. This is especially true in domains where the information presented can reach a large audience, such as traditional and social media. Scalable and reliable measurement of such differential framing is an important first step in addressing them. In this work, based on the intuition that framing affects the tone and… ▽ More

    Submitted 6 February, 2023; originally announced February 2023.

    Comments: In Findings of EMNLP 2022

    Journal ref: In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 6811-6825) (2022, December)

  16. arXiv:2301.00355  [pdf, ps, other

    cs.CL cs.AI cs.CY

    Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits

    Authors: Ruibo Liu, Chenyan Jia, Ge Zhang, Ziyu Zhuang, Tony X Liu, Soroush Vosoughi

    Abstract: We present Second Thought, a new learning paradigm that enables language models (LMs) to re-align with human values. By modeling the chain-of-edits between value-unaligned and value-aligned text, with LM fine-tuning and additional refinement through reinforcement learning, Second Thought not only achieves superior performance in three value alignment benchmark datasets but also shows strong human-… ▽ More

    Submitted 4 January, 2023; v1 submitted 1 January, 2023; originally announced January 2023.

    Comments: In proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

  17. arXiv:2210.05359  [pdf, other

    cs.CL cs.AI

    Mind's Eye: Grounded Language Model Reasoning through Simulation

    Authors: Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te-Yen Wu, Soroush Vosoughi, Claire Cui, Denny Zhou, Andrew M. Dai

    Abstract: Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm t… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

  18. arXiv:2210.03057  [pdf, other

    cs.CL cs.AI cs.LG

    Language Models are Multilingual Chain-of-Thought Reasoners

    Authors: Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, Jason Wei

    Abstract: We evaluate the reasoning abilities of large language models in multilingual settings. We introduce the Multilingual Grade School Math (MGSM) benchmark, by manually translating 250 grade-school math problems from the GSM8K dataset (Cobbe et al., 2021) into ten typologically diverse languages. We find that the ability to solve MGSM problems via chain-of-thought prompting emerges with increasing mod… ▽ More

    Submitted 6 October, 2022; originally announced October 2022.

  19. arXiv:2209.09433  [pdf, other

    cs.CL

    Non-Linguistic Supervision for Contrastive Learning of Sentence Embeddings

    Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi

    Abstract: Semantic representation learning for sentences is an important and well-studied problem in NLP. The current trend for this task involves training a Transformer-based sentence encoder through a contrastive objective with text, i.e., clustering sentences with semantically similar meanings and scattering others. In this work, we find the performance of Transformer models as sentence encoders can be i… ▽ More

    Submitted 19 September, 2022; originally announced September 2022.

    Comments: Accepted to NeurIPS 2022

  20. arXiv:2209.05707  [pdf, ps, other

    cs.CL cs.LG

    Robin: A Novel Online Suicidal Text Corpus of Substantial Breadth and Scale

    Authors: Daniel DiPietro, Vivek Hazari, Soroush Vosoughi

    Abstract: Suicide is a major public health crisis. With more than 20,000,000 suicide attempts each year, the early detection of suicidal intent has the potential to save hundreds of thousands of lives. Traditional mental health screening methods are time-consuming, costly, and often inaccessible to disadvantaged populations; online detection of suicidal intent using machine learning offers a viable alternat… ▽ More

    Submitted 12 September, 2022; originally announced September 2022.

    Comments: 10 pages, 4 figures

  21. arXiv:2205.12254  [pdf, other

    cs.CL cs.LG

    Interpretation Quality Score for Measuring the Quality of interpretability methods

    Authors: Sean Xie, Soroush Vosoughi, Saeed Hassanpour

    Abstract: Machine learning (ML) models have been applied to a wide range of natural language processing (NLP) tasks in recent years. In addition to making accurate decisions, the necessity of understanding how models make their decisions has become apparent in many applications. To that end, many interpretability methods that help explain the decision processes of ML models have been developed. Yet, there… ▽ More

    Submitted 24 May, 2022; originally announced May 2022.

  22. arXiv:2205.01308  [pdf, other

    cs.CL cs.AI

    Contrastive Learning for Prompt-Based Few-Shot Language Learners

    Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi

    Abstract: The impressive performance of GPT-3 using natural language prompts and in-context learning has inspired work on better fine-tuning of moderately-sized models under this paradigm. Following this line of work, we present a contrastive learning framework that clusters inputs from the same class for better generality of models trained with only limited examples. Specifically, we propose a supervised c… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

    Comments: accepted to NAACL 2022

  23. arXiv:2205.01307  [pdf, other

    cs.CL cs.AI

    Embedding Hallucination for Few-Shot Language Fine-tuning

    Authors: Yiren Jian, Chongyang Gao, Soroush Vosoughi

    Abstract: Few-shot language learners adapt knowledge from a pre-trained model to recognize novel classes from a few-labeled sentences. In such settings, fine-tuning a pre-trained language model can cause severe over-fitting. In this paper, we propose an Embedding Hallucination (EmbedHalluc) method, which generates auxiliary embedding-label pairs to expand the fine-tuning dataset. The hallucinator is trained… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

    Comments: accepted to NAACL 2022

  24. arXiv:2204.08123  [pdf, other

    cs.CL cs.AI cs.LG

    Non-Parallel Text Style Transfer with Self-Parallel Supervision

    Authors: Ruibo Liu, Chongyang Gao, Chenyan Jia, Guangxuan Xu, Soroush Vosoughi

    Abstract: The performance of existing text style transfer models is severely limited by the non-parallel datasets on which the models are trained. In non-parallel datasets, no direct map** exists between sentences of the source and target style; the style transfer models thus only receive weak supervision of the target sentences during training, which often leads the model to discard too much style-indepe… ▽ More

    Submitted 17 April, 2022; originally announced April 2022.

    Comments: In ICLR 2022

  25. arXiv:2204.03084  [pdf, other

    cs.CL cs.AI cs.LG

    Knowledge Infused Decoding

    Authors: Ruibo Liu, Guoqing Zheng, Shashank Gupta, Radhika Gaonkar, Chongyang Gao, Soroush Vosoughi, Milad Shokouhi, Ahmed Hassan Awadallah

    Abstract: Pre-trained language models (LMs) have been shown to memorize a substantial amount of knowledge from the pre-training corpora; however, they are still limited in recalling factually correct knowledge given a certain context. Hence, they tend to suffer from counterfactual or hallucinatory generation when used in knowledge-intensive natural language generation (NLG) tasks. Recent remedies to this pr… ▽ More

    Submitted 6 April, 2022; originally announced April 2022.

    Comments: In ICLR 2022

  26. arXiv:2203.16464  [pdf, other

    cs.LG cs.AI

    Towards Interpretable Deep Reinforcement Learning Models via Inverse Reinforcement Learning

    Authors: Sean Xie, Soroush Vosoughi, Saeed Hassanpour

    Abstract: Artificial intelligence, particularly through recent advancements in deep learning, has achieved exceptional performances in many tasks in fields such as natural language processing and computer vision. In addition to desirable evaluation metrics, a high level of interpretability is often required for these models to be reliably utilized. Therefore, explanations that offer insight into the process… ▽ More

    Submitted 1 March, 2024; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Paper accepted to ICPR 2022

  27. arXiv:2203.14498  [pdf, other

    cs.CL cs.AI cs.LG

    EnCBP: A New Benchmark Dataset for Finer-Grained Cultural Background Prediction in English

    Authors: Weicheng Ma, Samiha Datta, Lili Wang, Soroush Vosoughi

    Abstract: While cultural backgrounds have been shown to affect linguistic expressions, existing natural language processing (NLP) research on culture modeling is overly coarse-grained and does not examine cultural differences among speakers of the same language. To address this problem and augment NLP models with cultural background features, we collect, annotate, manually validate, and benchmark EnCBP, a f… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

    Comments: In Findings of ACL 2022

  28. Emotion-based Modeling of Mental Disorders on Social Media

    Authors: Xiaobo Guo, Yaojia Sun, Soroush Vosoughi

    Abstract: According to the World Health Organization (WHO), one in four people will be affected by mental disorders at some point in their lives. However, in many parts of the world, patients do not actively seek professional diagnosis because of stigma attached to mental illness, ignorance of mental health and its associated symptoms. In this paper, we propose a model for passively detecting mental disorde… ▽ More

    Submitted 23 January, 2022; originally announced January 2022.

    Comments: Proceedings of the 20th IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT)

  29. arXiv:2109.07023  [pdf, other

    cs.SI cs.AI cs.LG

    Embedding Node Structural Role Identity Using Stress Majorization

    Authors: Lili Wang, Chenghan Huang, Weicheng Ma, Ying Lu, Soroush Vosoughi

    Abstract: Nodes in networks may have one or more functions that determine their role in the system. As opposed to local proximity, which captures the local context of nodes, the role identity captures the functional "role" that nodes play in a network, such as being the center of a group, or the bridge between two groups. This means that nodes far apart in a network can have similar structural role identiti… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    Comments: In CIKM 2021

  30. arXiv:2109.07016  [pdf, other

    cs.LG cs.AI cs.SI

    Graph Embedding via Diffusion-Wavelets-Based Node Feature Distribution Characterization

    Authors: Lili Wang, Chenghan Huang, Weicheng Ma, Xinyuan Cao, Soroush Vosoughi

    Abstract: Recent years have seen a rise in the development of representational learning methods for graph data. Most of these methods, however, focus on node-level representation learning at various scales (e.g., microscopic, mesoscopic, and macroscopic node embedding). In comparison, methods for representation learning on whole graphs are currently relatively sparse. In this paper, we propose a novel unsup… ▽ More

    Submitted 14 September, 2021; originally announced September 2021.

    Comments: In CIKM 2021

  31. arXiv:2109.05748  [pdf, other

    cs.LG cs.CL

    GradTS: A Gradient-Based Automatic Auxiliary Task Selection Method Based on Transformer Networks

    Authors: Weicheng Ma, Renze Lou, Kai Zhang, Lili Wang, Soroush Vosoughi

    Abstract: A key problem in multi-task learning (MTL) research is how to select high-quality auxiliary tasks automatically. This paper presents GradTS, an automatic auxiliary task selection method based on gradient calculation in Transformer-based models. Compared to AUTOSEM, a strong baseline method, GradTS improves the performance of MT-DNN with a bert-base-cased backend model, from 0.33% to 17.93% on 8 na… ▽ More

    Submitted 13 September, 2021; originally announced September 2021.

    Comments: In EMNLP 2021

  32. Language Model Augmented Relevance Score

    Authors: Ruibo Liu, Jason Wei, Soroush Vosoughi

    Abstract: Although automated metrics are commonly used to evaluate NLG systems, they often correlate poorly with human judgements. Newer metrics such as BERTScore have addressed many weaknesses in prior metrics such as BLEU and ROUGE, which rely on n-gram matching. These newer methods, however, are still limited in that they do not consider the generation context, so they cannot properly reward generated te… ▽ More

    Submitted 18 August, 2021; originally announced August 2021.

    Comments: In ACL 2021

  33. Contributions of Transformer Attention Heads in Multi- and Cross-lingual Tasks

    Authors: Weicheng Ma, Kai Zhang, Renze Lou, Lili Wang, Soroush Vosoughi

    Abstract: This paper studies the relative importance of attention heads in Transformer-based models to aid their interpretability in cross-lingual and multi-lingual tasks. Prior research has found that only a few attention heads are important in each mono-lingual Natural Language Processing (NLP) task and pruning the remaining heads leads to comparable or improved performance of the model. However, the impa… ▽ More

    Submitted 18 August, 2021; originally announced August 2021.

    Comments: In ACL 2021

  34. Modulating Language Models with Emotions

    Authors: Ruibo Liu, Jason Wei, Chenyan Jia, Soroush Vosoughi

    Abstract: Generating context-aware language that embodies diverse emotions is an important step towards building empathetic NLP systems. In this paper, we propose a formulation of modulated layer normalization -- a technique inspired by computer vision -- that allows us to use large-scale language models for emotional response generation. In automatic and human evaluation on the MojiTalk dataset, our propos… ▽ More

    Submitted 17 August, 2021; originally announced August 2021.

    Comments: Findings of ACL 2021

  35. arXiv:2106.09923  [pdf, other

    cs.SI cs.AI

    Embedding Heterogeneous Networks into Hyperbolic Space Without Meta-path

    Authors: Lili Wang, Chongyang Gao, Chenghan Huang, Ruibo Liu, Weicheng Ma, Soroush Vosoughi

    Abstract: Networks found in the real-world are numerous and varied. A common type of network is the heterogeneous network, where the nodes (and edges) can be of different types. Accordingly, there have been efforts at learning representations of these heterogeneous networks in low-dimensional space. However, most of the existing heterogeneous network embedding methods suffer from the following two drawbacks… ▽ More

    Submitted 18 June, 2021; originally announced June 2021.

    Comments: In proceedings of the 35th AAAI Conference on Artificial Intelligence

  36. arXiv:2105.03075  [pdf, other

    cs.CL cs.AI cs.LG

    A Survey of Data Augmentation Approaches for NLP

    Authors: Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy

    Abstract: Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensi… ▽ More

    Submitted 1 December, 2021; v1 submitted 7 May, 2021; originally announced May 2021.

    Comments: Accepted to ACL 2021 Findings. GitHub repo with paper list at https://github.com/styfeng/DataAug4NLP ; Talk at https://www.youtube.com/watch?v=kNBVesKUZCk&ab_channel=StevenFeng ; Podcast at https://www.youtube.com/watch?v=qmqyT_97Poc&ab_channel=GradientFlow and https://thedataexchange.media/data-augmentation-in-natural-language-processing

  37. arXiv:2104.14795  [pdf, other

    cs.CL cs.AI

    Mitigating Political Bias in Language Models Through Reinforced Calibration

    Authors: Ruibo Liu, Chenyan Jia, Jason Wei, Guangxuan Xu, Lili Wang, Soroush Vosoughi

    Abstract: Current large-scale language models can be politically biased as a result of the data they are trained on, potentially causing serious problems when they are deployed in real-world settings. In this paper, we describe metrics for measuring political bias in GPT-2 generation and propose a reinforcement learning (RL) framework for mitigating political biases in generated text. By using rewards from… ▽ More

    Submitted 30 April, 2021; originally announced April 2021.

    Comments: In proceedings of the 35th AAAI Conference on Artificial Intelligence

  38. BigGreen at SemEval-2021 Task 1: Lexical Complexity Prediction with Assembly Models

    Authors: Aadil Islam, Weicheng Ma, Soroush Vosoughi

    Abstract: This paper describes a system submitted by team BigGreen to LCP 2021 for predicting the lexical complexity of English words in a given context. We assemble a feature engineering-based model with a deep neural network model founded on BERT. While BERT itself performs competitively, our feature engineering-based model helps in extreme cases, eg. separating instances of easy and neutral difficulty. O… ▽ More

    Submitted 19 April, 2021; originally announced April 2021.

    Comments: In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021). Colocated with ACL-IJCNLP 2021

  39. Lone Pine at SemEval-2021 Task 5: Fine-Grained Detection of Hate Speech Using BERToxic

    Authors: Yakoob Khan, Weicheng Ma, Soroush Vosoughi

    Abstract: This paper describes our approach to the Toxic Spans Detection problem (SemEval-2021 Task 5). We propose BERToxic, a system that fine-tunes a pre-trained BERT model to locate toxic text spans in a given text and utilizes additional post-processing steps to refine the boundaries. The post-processing steps involve (1) labeling character offsets between consecutive toxic tokens as toxic and (2) assig… ▽ More

    Submitted 8 April, 2021; originally announced April 2021.

    Comments: 7 pages, 3 figures. Accepted at SemEval-2021 Workshop, ACL-IJCNLP 2021

  40. Few-Shot Text Classification with Triplet Networks, Data Augmentation, and Curriculum Learning

    Authors: Jason Wei, Chengyu Huang, Soroush Vosoughi, Yu Cheng, Shiqi Xu

    Abstract: Few-shot text classification is a fundamental NLP task in which a model aims to classify text into a large number of categories, given only a few training examples per category. This paper explores data augmentation -- a technique particularly suitable for training with limited data -- for this few-shot, highly-multiclass text classification setting. On four diverse text classification tasks, we f… ▽ More

    Submitted 12 March, 2021; originally announced March 2021.

    Comments: To appear at NAACL 2021

  41. Feature Selection for Multivariate Time Series via Network Pruning

    Authors: Kang Gu, Soroush Vosoughi, Temiloluwa Prioleau

    Abstract: In recent years, there has been an ever increasing amount of multivariate time series (MTS) data in various domains, typically generated by a large family of sensors such as wearable devices. This has led to the development of novel learning methods on MTS data, with deep learning models dominating the most recent advancements. Prior literature has primarily focused on designing new network archit… ▽ More

    Submitted 21 October, 2021; v1 submitted 11 February, 2021; originally announced February 2021.

    Comments: In ICDM 2021 Workshop on Systematic Feature Engineering for Time-Series Data Mining (SFE-TSDM)

  42. arXiv:2101.05469  [pdf, other

    cs.CL

    Text Augmentation in a Multi-Task View

    Authors: Jason Wei, Chengyu Huang, Shiqi Xu, Soroush Vosoughi

    Abstract: Traditional data augmentation aims to increase the coverage of the input distribution by generating augmented examples that strongly resemble original samples in an online fashion where augmented examples dominate training. In this paper, we propose an alternative perspective -- a multi-task view (MTV) of data augmentation -- in which the primary task trains on original examples and the auxiliary… ▽ More

    Submitted 14 January, 2021; originally announced January 2021.

    Comments: Accepted to EACL 2021

  43. arXiv:2101.01391  [pdf, other

    cs.CL cs.AI

    Political Depolarization of News Articles Using Attribute-aware Word Embeddings

    Authors: Ruibo Liu, Lili Wang, Chenyan Jia, Soroush Vosoughi

    Abstract: Political polarization in the US is on the rise. This polarization negatively affects the public sphere by contributing to the creation of ideological echo chambers. In this paper, we focus on addressing one of the factors that contributes to this polarity, polarized media. We introduce a framework for depolarizing news articles. Given an article on a certain topic with a particular ideological sl… ▽ More

    Submitted 19 April, 2021; v1 submitted 5 January, 2021; originally announced January 2021.

    Comments: In Proceedings of the 15th International AAAI Conference on Weblogs and Social Media (ICWSM 2021)

  44. arXiv:2012.13675  [pdf, other

    cs.SI cs.AI cs.CL

    Social media data reveals signal for public consumer perceptions

    Authors: Neeti Pokhriyal, Abenezer Dara, Benjamin Valentino, Soroush Vosoughi

    Abstract: Researchers have used social media data to estimate various macroeconomic indicators about public behaviors, mostly as a way to reduce surveying costs. One of the most widely cited economic indicator is consumer confidence index (CCI). Numerous studies in the past have focused on using social media, especially Twitter data, to predict CCI. However, the strong correlations disappeared when those mo… ▽ More

    Submitted 25 December, 2020; originally announced December 2020.

    Comments: In Proceedings of the ACM International Conference on AI in Finance (ICAIF '20)

  45. Multi-modal Identification of State-Sponsored Propaganda on Social Media

    Authors: Xiaobo Guo, Soroush Vosoughi

    Abstract: The prevalence of state-sponsored propaganda on the Internet has become a cause for concern in the recent years. While much effort has been made to identify state-sponsored Internet propaganda, the problem remains far from being solved because the ambiguous definition of propaganda leads to unreliable data labelling, and the huge amount of potential predictive features causes the models to be inex… ▽ More

    Submitted 23 December, 2020; originally announced December 2020.

    Comments: Proceedings of the 25th International Conference on Pattern Recognition (ICPR 2020)

  46. Improvements and Extensions on Metaphor Detection

    Authors: Weicheng Ma, Ruibo Liu, Lili Wang, Soroush Vosoughi

    Abstract: Metaphors are ubiquitous in human language. The metaphor detection task (MD) aims at detecting and interpreting metaphors from written language, which is crucial in natural language understanding (NLU) research. In this paper, we introduce a pre-trained Transformer-based model into MD. Our model outperforms the previous state-of-the-art models by large margins in our evaluations, with relative imp… ▽ More

    Submitted 7 December, 2020; originally announced December 2020.

  47. Dartmouth CS at WNUT-2020 Task 2: Informative COVID-19 Tweet Classification Using BERT

    Authors: Dylan Whang, Soroush Vosoughi

    Abstract: We describe the systems developed for the WNUT-2020 shared task 2, identification of informative COVID-19 English Tweets. BERT is a highly performant model for Natural Language Processing tasks. We increased BERT's performance in this classification task by fine-tuning BERT and concatenating its embeddings with Tweet-specific features and training a Support Vector Machine (SVM) for classification… ▽ More

    Submitted 7 December, 2020; originally announced December 2020.

    Comments: Proceedings of the 6th Workshop on Noisy User-generated Text (W-NUT) at EMNLP 2020

  48. Big Green at WNUT 2020 Shared Task-1: Relation Extraction as Contextualized Sequence Classification

    Authors: Chris Miller, Soroush Vosoughi

    Abstract: Relation and event extraction is an important task in natural language processing. We introduce a system which uses contextualized knowledge graph completion to classify relations and events between known entities in a noisy text environment. We report results which show that our system is able to effectively extract relations and events from a dataset of wet lab protocols.

    Submitted 7 December, 2020; originally announced December 2020.

    Comments: Proceedings of the 6th Workshop on Noisy User-generated Text (W-NUT) at EMNLP 2020

  49. An Empirical Survey of Unsupervised Text Representation Methods on Twitter Data

    Authors: Lili Wang, Chongyang Gao, Jason Wei, Weicheng Ma, Ruibo Liu, Soroush Vosoughi

    Abstract: The field of NLP has seen unprecedented achievements in recent years. Most notably, with the advent of large-scale pre-trained Transformer-based language models, such as BERT, there has been a noticeable improvement in text representation. It is, however, unclear whether these improvements translate to noisy user-generated text, such as tweets. In this paper, we present an experimental survey of a… ▽ More

    Submitted 7 December, 2020; originally announced December 2020.

    Comments: In proceedings of the 6th Workshop on Noisy User-generated Text (W-NUT) at EMNLP 2020

  50. arXiv:2012.02954  [pdf, other

    cs.CL cs.LG

    Enhanced Offensive Language Detection Through Data Augmentation

    Authors: Ruibo Liu, Guangxuan Xu, Soroush Vosoughi

    Abstract: Detecting offensive language on social media is an important task. The ICWSM-2020 Data Challenge Task 2 is aimed at identifying offensive content using a crowd-sourced dataset containing 100k labelled tweets. The dataset, however, suffers from class imbalance, where certain labels are extremely rare compared with other classes (e.g, the hateful class is only 5% of the data). In this work, we prese… ▽ More

    Submitted 5 December, 2020; originally announced December 2020.

    Comments: In ICWSM 2020 Data Challenge. Online