4/15/2023 0 Comments Labeljoy qty fieldLanguage models can be evaluated subjectively by humans/linguistic experts or objectively using performance evaluation metrics to evaluate a system’s performance. The accuracy achieved by various existing language models is far below the acceptable threshold to be employed successfully in various text sequence generation tasks. The experiments presented in this paper on three highly imbalanced datasets from different domains show that the performance of same deep neural network models improve up to 17 % when datasets are balanced using generated text.Īlbeit challenging, many models can be proposed to provide a shared background domain knowledge and lexicon and grammar validation for language understanding. We exploit recently proposed GPT-2 and LSTM-based text generation models to introduce balance in highly imbalanced text datasets. In this paper, we address this issue by assessing text sequence generation algorithms coupled with grammatical validation on domain-specific highly imbalanced datasets for text classification. While such techniques have proved useful for synthetic numerical and image data generation using GANs, the effectiveness of approaches proposed for textual data, which can retain grammatical structure, context, and semantic information, has yet to be evaluated. Synthetic data generation and oversampling techniques such as SMOTE, AdaSyn can address this issue for statistical data, yet such methods suffer from overfitting and substantial noise. Lack of enough data samples across all the class labels results in data imbalance causing poor classification performance while training the model. Imagine a customers’ dataset for bank loans-majority of the instances belong to non-defaulter class, only a small number of customers would be labeled as defaulters, however, the performance accuracy is more important on defaulters labels than non-defaulter in such highly imbalance datasets. Quite often, the minority class data is of great importance representing concepts of interest and is often challenging to obtain in real-life scenarios and applications. Data imbalance is a frequently occurring problem in classification tasks where the number of samples in one category exceeds the amount in others.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |