Our shared repository related to the Fake News Project in Data Science.
Find a file
2023-03-31 15:43:08 +02:00
.github/workflows hey ho 2023-03-30 11:53:22 +02:00
complex_models Reformatted python files 2023-03-30 19:46:15 +02:00
data Removed unused model 2023-03-30 21:33:24 +02:00
liar Removed data files 2023-03-30 21:41:29 +02:00
lib More comments 2023-03-30 21:29:50 +02:00
notebooks updated readme 2023-03-30 21:46:11 +02:00
report Final commit. 2023-03-31 11:36:48 +02:00
simple_model Format 2023-03-30 21:31:23 +02:00
.gitignore Removed pycaches 2023-03-16 13:54:27 +01:00
download_data.sh Removed data files 2023-03-30 21:41:29 +02:00
flake.lock Added pipfile 2023-03-02 13:22:08 +01:00
flake.nix Simplified flake 2023-03-26 15:15:07 +02:00
make_ROC_fakenews.py Reformatted python files 2023-03-30 19:46:15 +02:00
Pipfile Generated umap 2023-03-30 16:23:46 +02:00
Pipfile.lock Updated umaps 2023-03-29 13:22:20 +02:00
predict_on_liar.py Reformatted python files 2023-03-30 19:46:15 +02:00
preprocess.ipynb finished pipeline 2023-03-30 22:01:15 +02:00
README.md added arch machine 2023-03-31 14:55:14 +02:00
tokenise.py More comments 2023-03-30 21:29:50 +02:00
umap_fake.py Added documentation and removed old umap file 2023-03-30 21:25:43 +02:00
umap_liar.py Added documentation and removed old umap file 2023-03-30 21:25:43 +02:00

Fake News Project

Group 1 - shared repository related to the Fake News Project in Data Science 2023.
The code has only been tested on Linux (Arch, NixOS, Fedora). The code will not work on Windows since the scripts use UNIX paths.

Specs

The code was run on various machines, the large multithreaded tasks (tokenisation, TF-iDF, SVD, UMAP, etc.) and the XGBoost, small DNN, and logistic regression were all trained on the server with specs below. Other tasks (big DNN, SQL, data exploration, etc.) were done on two other machines.

Server

  • 40GiB RAM (2x8, 6x4), 20GiB of virtual compressed ram (ZRAM). DDR3 with 1333MHz clock.
  • i7-3820 CPU
  • p9x79 motherboard

Arch machine

  • 7.6GiB Ram, 2GiB swap
  • Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz

Pipeline

In order to reproduce our pipeline on FakeNews dataset:

  1. Run download_data.sh which downloads, extracts and removes carriage returns.
  2. Enter the virtual python env by running pipenv shell
  3. Go through the steps in preprocess.ipynb.
  4. Choose which script to run, it being one of either:
    • simple_model/simple_model.py
    • complex_models/bigdnn_complex.py
    • complex_models/dnn_complex.py
    • complex_models/xgboost_complex.py

Some of these scripts require copious amounts of RAM, (+30 GiB) However, bigdnn_complex.py (ironically) does everything in chunks, except when fitting the tf-idf vectoriser. This can however be done rather effectively on a single parquet file from test.parquet, making it possible to run train and predict this model on a low-resource computer, even using the entire dataset (which was done on the Arch machine). Other files such as tokenise.py exists as a headless alternative to some of the steps in the notebook so that it can be run in a terminal.

  • Running umap_fake.py, or umap_liar.py produces beautiful unsupervised maps of the corpus.

LIAR

To prepare dataset:

  1. Run liar/download_data.sh
  2. Run liar/tokenise.py

Now the liar dataset is ready and you can run predict_on_liar.py to check how to models perform on this out-of-domain data.

Dependencies

pipenv is required to ensure that your python environment is the correct one.
Nix flakes were used to install pipenv on our server.