NLP Classification

Predicting Scientific Rigor Through Text

Reddit, ubiquitously dubbed “the front page of the internet,” is one of the largest and most comprehensively documented human conversations of our time.

Frequented by 330 million users monthly, Reddit is home to a multitude of forums and communities for nearly every topic imaginable. This makes Reddit a perfect candidate for NLP (or natural language processing) models, as it is truly representative of the way people communicate through writing day-to-day.

Given the prominence of social media as a communication platform, I was curious to test whether a computer could use short-form text compositions to distinguish a substantiated claim from an unsubstantiated one.

This project uses NLP modeling techniques to predict the class of text, taken from one of two subreddits: r/Science and r/EverythingScience. The two subreddits are nearly identical in subject matter; users post articles on news topics relevant to the scientific community. The only major distinguishing characteristic between them is the fact that r/Science requires all posts to be peer-reviewed studies, while r/EverythingScience does not.

Check out the GitHub Repo for this project