Introduction

The problem of fake news has been ever increasing with the booming age of internet where thoughts, and information can be conveyed to the whole world with a single press of a button. Fake news are spread everywhere, by everyone, knowingly or unknowingly, from whatsapp messages of our uncles to even large media houses. False information is shared with intentions of political gain or for spreading propaganda against particular casts, creed, race or religion, hence this is a very critical issue in today’s world.
We present the solution to the task of differentiating the fake news from authentic ones by using Deep Learning architectures. Automated detection of fake news is a hard task to accomplish as it requires the model to not only understand nuances in natural language, but also understand how related or unrelated the reported news is when compared to real news, instead of a binary classification task. To address these gaps, we present a Neural Network architecture combined with Natural Language Processing techniques to accurately predict the stance for a given news article.

Dataset

To train our Neural Network for this task, we have used two datasets which comprise of news articles from the time of US Presidential election of 2016 from various media houses and sources.

  1. Fake News from Kaggle
  2. Fake real news_dataset from GitHub

These two datasets have been merged together to create a larger dataset for training and validation with a random shuffle and a test split ratio of 0.35 (this is relatively high to counter overfitting). You can find the merged dataset here.

Exploratory Data Analysis

  1. Event Rates
    • Train Dataset: 0(Real) - 10333(49.24%) 1(Fake) - 10654(50.76%)
    • Test Dataset: 0(Real) - 5564(49.23%) 1(Fake) - 5738(50.769%)
    • Overall: 0(Real) - 15897(49.16 %) 1(Fake) - 16428(50.84 %)
  2. Sentiment Analysis of Dataset -
    • We trained the model with only CNN backbone on IMDB movie rating dataset with labels being 1 and 0 i.e. Positive and Negative.
    • On doing the Sentiment classification on the text of our fake news dataset, we find that 65.3% of the news are negative news and 34.7% are positive news.
  3. Word Count Boxplot -
    • Mean Word count is 446.307 and standard deviation of the word count is 428.3
    • 25% words have word count less than 163 and 75% words have word count less than 621

Preprocessing and Text Cleaning

The text from every news article is processed before being fed into the model for training. The following steps were done in sequence. These tasks were performed using Regex library.

The notebook GNR652_Fake_News_Detection_DL_preprocessing.ipynb in the src folder contains the code.

Model Building

Our Neural Network comprises of 6 different kinds of layers, they are as follows:

The notebook GNR652Fake_News_Detection_Model_building(1).ipynb in the src folder contains the code. GloVe is used for obtaining pre trained vector representation of words for the embedding layer. To download Stanford’s GloVe 100d word embedding file, click here.

Results

The model was trained for 100 epochs. The imgs folder contains the accuracy and loss plots. Prediction accuracy is 89.60 % on test dataset and 93.74 % on train dataset.

Inferences