IO2-A. A short review of research in automatic detection of misinformation

Automatic detection of misinformation is receiving a huge attention in recent years due to exponential growth of social media users, which (re)post informative contents continuously so classical fact checking procedures are actually unpractical. The main interest is towards the recognition of fake news in posts related to both political and economic topics that is intentional disinformation, but in principle some of the algorithmic results reported in the literature are applicable to other topics, diverse informative sources, and to some extent also to unintentional misinformation.

In the following short survey we will describe some of the most recent findings in this field; the reader is referred to [9][12] for a detailed coverage of the topic.

Misinformation detection is faced using Machine Learning (ML) techniques. An informative source on the Internet like a Wikipedia page, a social media post, or a blog article has different constituents but no doubt the more discriminative one remains the text. In general, the multimedia contents of an article are characterized through “tags” i.e. textual labels that can be analyzed as proper sentences. As a consequence the main task in a process of automated detection of misinformation is sentence or text classification; this Natural Language Processing (NLP) task consists in giving a piece of text or even a single sentence a label that describes its content.

In the Deep Learning era, text classification has gained important results from the application of Convolutional Neural Networks (CNN) applied to a vector embeddings derived from the ensemble of the words in the text [13][5], while the tasks related to understanding a sentence like Question Answering (QA), or Part Of Speech (POS) tagging leverage Recurrent Neural Networks (RNN) to catch the relations between words in their position along the sentence.

As in every ML application, the focus in automatic misinformation detection is on three topics: the availability of rich annotated datasets, a suitable modeling of both the problem and the data, and the best choice for the learning algorithm. Building a huge corpus of annotated articles is a very hard task, that involves downloading raw articles for a long period of time, filtering out posts that are not related to the task (i.e. Wikipedia disambiguation pages, or posts reporting emotional states or opinions) and manual labeling of the remaining data. The problem of having suitable data for misinformation detection along with an interesting review of some well assessed datasets, has been recently published in [1]. Here the authors face also the need of using different cues for assessing the veracity of a news article i.e. the website reputation.

In general, all the ML approaches in this field model the problem of detecting a false article using additional sources of information apart from the veracity label of the article and the vector representation of its textual components. Rubin et al. [2][6][7] report an interesting review of the approaches to fake news detection from the methodological point of view. The article is analyzed using the surface-level linguistic structure, the style of writing, the semantic analysis of the text, but also insights coming from the Social Network Analysis are taken into consideration as the spreading of an article in a community. Moreover, authors propose also the use of satirical cues to detect particular styles strictly connected to writing a false post.

As regards data modeling, the main approach relies on the use of the so called Bag Of Words (BOW) techniques that is vector representations of the single words in the text, as already mentioned, without retaining the structure of the sentence. In this respect, the article is a document belonging to a corpus, and the TF-IDF measure of the frequency of each word in the corpus related to the number of documents which contain a particular word is used to create the word embedding. Other approaches use word2vec a well known embedding built using a neural network trained on the sentence. BOW techniques have several limitations in terms of retaining the deep semantics of the text deriving mainly from the text structure. A widespread technique for extending BOW approaches without losing semantics is Paragraph Vector [4] that is a fixed length representation of a document, starting from variable length textual corpora, which is trained to predict the occurrence of a word in an unknown document. This is a suitable representation for fake news detection using CNN. A well accepted technique for topic representation in document corpora is the Latent Dirichlet Allocation (LDA) that is a Bayesian scheme of classification of documents in terms of a set of topics they deal with, which in turn are represented as probability distributions across the documents in the corpus. In [3] is reported a tweet credibility classification work relying on LDA. As regards the classification techniques, in what follows, a brief outline of some recent approaches is reported. Authors in [8] propose the Capture Score Integrate (CSI) model that integrates a Recurrent Neural Network (RNN) and a fully connected one for describing both the temporal evolution of an article, and its level of diffusion between different users in order to asses fakeness. It relies on the notion of article engagement that is a triple stating that the user ui publishes the article aj at time t. The network uses also the text of the article that is described in a suitable vector form. Two datasets from both Twitter and Veibo are used for the experiments.

In [10] an approach for binary classification of hoax – non hoax posts on Facebook based on the number of “likes” received by each article is proposed, and it is compares with a well-known binary classifier that is logistic regression. The proposed method is based on the harmonic Boolean label crowdsourcing that is an iterative learning scheme aimed at building a consensus hoax-non hoax label for each article based on the labels provided by each user along with their indication about the post’s vandalism, if it violates some community guideline, and so on. The dataset is labelled manually starting from a selection of posts collected through the Facebook Graph API.

In [11] the authors present a dataset of short statements extracted from Politifact.com in the period 2007-2016 and labelled manually as regards their degree of veracity along with other metadata such as the context, the subject, the speaker, and so on. The idea is that the surface-level linguistic form of a statement conveys insights about its possible fakeness. Authors present also a deep neural architecture where two Convolutional Neural Networks (CNN) learn separately the vector embeddings representing the text words, and the meta-data that are intended as a sequence of terms. A RNN is used to combine the meta-data sequence with the CNN processing the text.

 

References to research in automatic detection of misinformation

  1. Asr, F. T., & Taboada, M. (2018). The Data Challenge in Misinformation Detection: Source Reputation vs. Content Veracity. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER) (pp. 10-15). 
  2. Conroy, N. J., Rubin, V. L., & Chen, Y. (2015, November). Automatic deception detection: Methods for finding fake news. In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community (p. 82). American Society for Information Science.
  3. Ito, J., Song, J., Toda, H., Koike, Y., & Oyama, S. (2015, May). Assessment of tweet credibility with LDA features. In Proceedings of the 24th International Conference on World Wide Web (pp. 953-958). ACM. 
  4. Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196).
  5. Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. 
  6. Rubin, V. L., Chen, Y., & Conroy, N. J. (2015, November). Deception detection for news: three types of fakes. In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community (p. 83). American Society for Information Science.
  7. Rubin, V., Conroy, N., Chen, Y., & Cornwell, S. (2016). Fake news or truth? using satirical cues to detect potentially misleading news. In Proceedings of the Second Workshop on Computational Approaches to Deception Detection (pp. 7-17).
  8. Ruchansky, N., Seo, S., & Liu, Y. (2017, November). Csi: A hybrid deep model for fake news detection. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (pp. 797-806). ACM. 
  9. Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22-36.
  10. Tacchini, E., Ballarin, G., Della Vedova, M. L., Moret, S., & de Alfaro, L. (2017). Some like it hoax: Automated fake news detection in social networks. arXiv preprint arXiv:1704.07506.
  11. Wang, W. Y. (2017). " liar, liar pants on fire": A new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648.
  12. Wu, L., Morstatter, F., Hu, X., & Liu, H. (2016). Mining misinformation in social media. Big Data in Complex and Social Networks, 123-152. 
  13. Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649-657).

 

Print Friendly, PDF & Email