Knowledge Agora

Title	The Effect of Training Data Size on Disaster Classification from Twitter
ID_Doc	65149
Authors	Effrosynidis, D; Sylaios, G; Arampatzis, A
Title	The Effect of Training Data Size on Disaster Classification from Twitter
Year	2024
Published	Information, 15.0, 7
Abstract	In the realm of disaster-related tweet classification, this study presents a comprehensive analysis of various machine learning algorithms, shedding light on crucial factors influencing algorithm performance. The exceptional efficacy of simpler models is attributed to the quality and size of the dataset, enabling them to discern meaningful patterns. While powerful, complex models are time-consuming and prone to overfitting, particularly with smaller or noisier datasets. Hyperparameter tuning, notably through Bayesian optimization, emerges as a pivotal tool for enhancing the performance of simpler models. A practical guideline for algorithm selection based on dataset size is proposed, consisting of Bernoulli Naive Bayes for datasets below 5000 tweets and Logistic Regression for larger datasets exceeding 5000 tweets. Notably, Logistic Regression shines with 20,000 tweets, delivering an impressive combination of performance, speed, and interpretability. A further improvement of 0.5% is achieved by applying ensemble and stacking methods.
PDF	https://doi.org/10.3390/info15070393

No similar articles found.