Skip to main navigation Skip to search Skip to main content

Label-Aware Pseudo-Training Sample Generation for Text Classification

  • Arash Yousefi Jordehi
  • , Seyed Abolghasem Mirroshandel
  • , Owen Rambow
  • Guilan University

Research output: Contribution to journalArticlepeer-review

Abstract

Deep learning models excel in various Natural Language Processing (NLP) tasks, but their performance (excluding approaches like zero-shot learning or few-shot learning) relies on ample data, posing challenges in fields with limited datasets. To address the poverty in the size of training data, a number of approaches could be taken, such as multi-task learning and data augmentation. Aiming to leverage Large Language Models (LLMs), we propose a data augmentation algorithm. It subtly alters sentences by inserting random words and utilizes LLMs to find the most fitting replacements within their embedding space. Taking inspiration from Prompt Tuning, the focus shifts from optimizing the input prompt to updating the inserted tokens’ embedding vectors by maximizing the conditional generation probability. This allows for vast sample generation while implicitly benefiting from the knowledge within LLMs. The results from our extensive set of experiments on various benchmark text classification tasks show a substantial improvement over the non-augmented outcomes.

Original languageEnglish
Article number22
JournalJournal of Artificial Intelligence Research
Volume85
DOIs
StatePublished - 2026

Fingerprint

Dive into the research topics of 'Label-Aware Pseudo-Training Sample Generation for Text Classification'. Together they form a unique fingerprint.

Cite this