Set of obfuscated spam dataset by using LeetSpeak transformations

De Mendizabal, Iñaki Velez; Vidriales, Xabier; Fernandes, Vitor Basto; Ezpeleta, Enaitz; Méndez, José Ramón; Zurutuza, Urko

doi:10.5281/ZENODO.6373652

Set of obfuscated spam dataset by using LeetSpeak transformations

De Mendizabal, Iñaki Velez ¹
Vidriales, Xabier ¹
Fernandes, Vitor Basto ²
Ezpeleta, Enaitz ¹
Méndez, José Ramón ³
Zurutuza, Urko ¹

1 Universidad de Mondragón/Mondragon Unibertsitatea

Universidad de Mondragón/Mondragon Unibertsitatea

Mondragón, España

ROR https://ror.org/00wvqgd19
2 Instituto Universitário de Lisboa (ISCTE-IUL)
3 Universidade de Vigo

Universidade de Vigo

Vigo, España

ROR https://ror.org/05rdf8595

Erakutsi afiliazioak +

Argitaratzaile: Zenodo

Argitalpen urtea: 2022

Mota: Dataset

DOI: 10.5281/ZENODO.6373652 Sarbide irekia editor

Laburpena

The usage of LeetSpeak and other text hiding tricks is often used by spammers in the distribution of unsolicited contents. To evaluate deobfuscation techniques and their impact on spam content classification, we preprocessed several popular public datasets to partially obfuscate the text. The datasets transformed are: YouTube Spam Collection [2, 3] which is available on https://www.dt.fee.unicamp.br/~tiago/youtubespamcollection/. a subset of YouTube Comments [4, 5] which is available on http://mlg.ucd.ie/yt/. CSDMC2010 which is available on http://csmining.org/index.php/spam-email-datasets-.html. TREC2007 which is available on https://plg.uwaterloo.ca/~gvcormac/treccorpus07/

Set of obfuscated spam dataset by using LeetSpeak transformations

Universidad de Mondragón/Mondragon Unibertsitatea

Universidade de Vigo

Laburpena