Hate speech and social acceptance of migrants in Europe: Analysis of tweets with geolocation (ICPSR doi:10.17903/FK2/G83HNY)

View:

Part 1: Document Description
Part 2: Study Description
Part 3: Data Files Description
Part 5: Other Study-Related Materials
Entire Codebook

Document Description

Citation

Title:

Hate speech and social acceptance of migrants in Europe: Analysis of tweets with geolocation

Identification Number:

doi:10.17903/FK2/G83HNY

Distributor:

Κατάλογος Δεδομένων SoDaNet

Date of Distribution:

2024-04-30

Version:

1

Bibliographic Citation:

Arcila Calderón, Carlos, 2024, "Hate speech and social acceptance of migrants in Europe: Analysis of tweets with geolocation", https://doi.org/10.17903/FK2/G83HNY, Κατάλογος Δεδομένων SoDaNet, version 2, UNF:6:5DjJEZExD6zQ5T6FgUfbAA== [fileUNF]

Holdings Information:

https://doi.org/10.17903/FK2/G83HNY

Study Description

Citation

Title:

Hate speech and social acceptance of migrants in Europe: Analysis of tweets with geolocation

Identification Number:

doi:10.17903/FK2/G83HNY

Authoring Entity:

Arcila Calderón, Carlos (University of Salamanca)

Software used in Production:

Other

Distributor:

Κατάλογος Δεδομένων SoDaNet

Date of Distribution:

2024-04-30

Holdings Information:

https://doi.org/10.17903/FK2/G83HNY

Study Scope

Topic Classification:

Media

Abstract:

The data project includes large-scale longitudinal analysis (2015-2020) of online hate speech on Twitter (N=847,978). A tweet database was generated: collected tweets using <a href= “https://developer.twitter.com/en/docs/tutorials/getting-historical-tweets-using-the-full-archive-search-endpoint” target=”_blank”>Twitter’s Application Programming Interface (API) (v2 full-archive search endpoint, using Academic research product track)</a>, which provides access to the historical archive of messages since Twitter was created in 2006. To download the tweets, we first defined the search filter by keyword and geographic zones using the Python programming language and the <a href=” https://www.nltk.org” target=”_blank”>NLTK</a>, <a href=” https://www.tensorflow.org” target=”_blank”>Tensorflow</a>, <a href=” https://keras.io” target=”_blank”>Keras</a> and <a href=” https://numpy.org” target=”_blank”>Numpy</a> libraries. We established generic words directly related with the topic, taking into account linguistic agreement in Spanish (i.e., gender and number inflections) but without considering adjectives, for instance: <em>migrant, migrants, immigrant, immigrants, refugee</em> (both in masculine and feminine forms in Spanish), <em>refugees</em> (both in masculine and feminine forms in Spanish), <em>asylum seeker, asylum seekers</em> (the keywords are available as supplementary materials <a href=” https://doi.org/10.6084/m9.figshare.16708945.v3” target=”_blank”>here</a>. For the process of hate speech detection in tweets, we used as <a href=” http://pharminterface.usal.es” target=”_blank”>a basis a tool</a> created and validated by <a href=” https://doi.org/10.3390/fi13030080” target=”_blank”>Vrysis et al. (2021)</a>. For this research, the tool has been retrained with: <ol> <li>supervised dictionary-based term detection; and </li> <li>also taking an unsupervised approach (machine learning with neural networks) </li></ol> Using a corpus of 90,977 short messages, from which 15,761 were in Greek (5,848 with hate toward immigrants), 46,012 were in Spanish (11,117 with hate toward immigrants) and 29,204 in Italian (5,848 with hate toward immigrants). This corpus comes from two sources: <ol> <li> the import of already classified messages in other databases (n=57,328, of which 5,362 are generic messages in Greek, 23,787 are generic messages and 9,727 are messages with hate toward immigrants in Spanish, and 18,452 are generic messages in Italian), </li> <li> and the other from messages manually coded by local trained analysts (in Spain, Greece and Italy), using at least 2 coders with total agreement between them (the level of agreement in the tests was 94%), dismissing those without a 100% intercoder agreement (n=33,649, of which 6,040 are messages about immigration without hate and 4,359 are messages with hate toward immigrants in Greek; 11,108 are messages about immigration without hate and 1,390 are messages with hate toward immigrants in Spanish; and 4,904 are messages about immigration without hate and 5,848 are messages with hate toward immigrants in Italian). </li> </ol> The corpus was divided into 80% training and 20% test.In the models, embeddings were used for the representation of language and Recurrent Neural Networks (RNN) for the supervised text classification. Specifically, the embeddings were created with the 1,000 most repeated words with 8 dimensions (first input layer), two hidden layers’ type Gated Recurrent Unit (GRU) with 64 neurons each, and a dense output layer with one neuron and softmax activation (the model is compiled with Adam optimizing and the Sparse Categorical Crossentropy loss).

Time Period:

2015-01-01-2020-12-31

Unit of Analysis:

Media unit: Text

Methodology and Processing

Mode of Data Collection:

Content coding

Mode of Data Collection:

Other

Type of Research Instrument:

Programming script

Data Access

File Description--f6053

File: tweets_HMB.tab

  • Number of cases: 882346

  • No. of variables per record: 8

  • Type of File: text/tab-separated-values

Notes:

UNF:6:5DjJEZExD6zQ5T6FgUfbAA==

Other Study-Related Materials

Label:

Codebook_Tweets.pdf

Notes:

application/pdf