Many computational social science projects examine online discourse surrounding a specific event, such as natural disasters, sporting matches, and political events. A costly bottleneck in this work is collecting relevant social media posts. Normally, an initial corpus would be collected using a high-recall, low-precision keyword search, and filtered manually by human judges.
We design an efficient Siamese architecture to minimize the distance between embeddings of articles and their comments. By allowing researchers to automatically filter corpora using comments most similar to news articles describing the event of interest, this approach can greatly mitigate time and money spent on manual annotation.