Many computational social science projects examine online discourse surrounding a specific event, such as natural disasters, sporting matches, and political events. A costly bottleneck in this work is collecting relevant social media posts. Normally, an initial corpus would be collected using a high-precision, low-recall keyword search, and filtered manually by human judges.
We propose a distantly supervised approach by leveraging social media posts which share news articles as a relevance signal, and use a Siamese architecture to minimize the distance between embeddings of articles and their comments. By allowing researchers to automatically filter corpora using comments most similar to news articles describing the event of interest, this approach can greatly mitigate time and money spent on manual annotation.
Paper/technical report to be posted shortly.