Knowledge Agora



Scientific Article details

Title The Effect of Training Data Size on Disaster Classification from Twitter
ID_Doc 65149
Authors Effrosynidis, D; Sylaios, G; Arampatzis, A
Title The Effect of Training Data Size on Disaster Classification from Twitter
Year 2024
Published Information, 15.0, 7
DOI 10.3390/info15070393
Abstract In the realm of disaster-related tweet classification, this study presents a comprehensive analysis of various machine learning algorithms, shedding light on crucial factors influencing algorithm performance. The exceptional efficacy of simpler models is attributed to the quality and size of the dataset, enabling them to discern meaningful patterns. While powerful, complex models are time-consuming and prone to overfitting, particularly with smaller or noisier datasets. Hyperparameter tuning, notably through Bayesian optimization, emerges as a pivotal tool for enhancing the performance of simpler models. A practical guideline for algorithm selection based on dataset size is proposed, consisting of Bernoulli Naive Bayes for datasets below 5000 tweets and Logistic Regression for larger datasets exceeding 5000 tweets. Notably, Logistic Regression shines with 20,000 tweets, delivering an impressive combination of performance, speed, and interpretability. A further improvement of 0.5% is achieved by applying ensemble and stacking methods.
Author Keywords event detection; disaster; twitter; classification; ensemble
Index Keywords Index Keywords
Document Type Other
Open Access Open Access
Source Emerging Sources Citation Index (ESCI)
EID WOS:001277212600001
WoS Category Computer Science, Information Systems
Research Area Computer Science
PDF https://doi.org/10.3390/info15070393
Similar atricles
Scroll