Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media
Workshop: Novel Evaluation Approaches for Text Classification Systems (NEATCLasS)DOI: 10.36190/2023.56
Social media research is currently confronted with a data-sharing problem, as social media platforms prohibit full data distribution in their terms of service. Until recent changes to the platform, Twitter was an exception, allowing academics to legally share Tweet and user IDs with peers, which could then be re-collected using the Academic API endpoints. This work investigates how Twitter data is currently shared in two domains of harmful online communication - abusive language and social bot detection. We find that the currently frequently utilized intermediate strategy of sharing Twitter IDs suffers from substantial data loss, leading to the incomparability of computational results. Moreover, recent changes in the API result in additional expenses and an increased collection time that may have an impact on the feasibility of research projects. All of these aspects further fuel the reproducibility crisis that social media analytics currently faces. To improve the current situation, we propose several best practices for research projects utilizing ID-based datasets for their experiments and provide recommendations for researchers who want to share their Twitter data with peers.