{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Sentiment classification with VADER" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## loading IMDB dataset\n", "Only the test data is loaded, since VADER does not require training data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#download only once\n", "from urllib import request\n", "url = \"https://goo.gl/mg8bsD\"\n", "response = request.urlopen(url)\n", "text = response.read().decode('utf-8')\n", "with open('imdb_test.txt',mode='w',encoding='utf-8') as outputfile:\n", " outputfile.write(text)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('imdb_test.txt',mode='r',encoding='utf-8') as inputfile:\n", " text = inputfile.read()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import csv\n", "x_test = list()\n", "y_test = list()\n", "with open('imdb_test.txt', encoding='utf-8', newline='') as infile:\n", " reader = csv.reader(infile, delimiter='\\t')\n", " for row in reader:\n", " x_test.append(row[0])\n", " y_test.append(int(row[1]))\n", "x_test[0],y_test[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(x_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## creating VADER classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.sentiment.vader import SentimentIntensityAnalyzer\n", "vader = SentimentIntensityAnalyzer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vader.polarity_scores('not the best experience I had')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "vader.polarity_scores(':P')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## classification of test data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scores = list()\n", "for text in x_test:\n", " scores.append(vader.polarity_scores(text)['compound'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "list(zip(scores,y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to assign labels only when the confidence is very high.\n", "This is not useful when you must classify all documents.\n", "It is instead useful when you want to bootstrap a training set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "selection = []\n", "for score,label in list(zip(scores,y_test)):\n", " if abs(score)>0.99:\n", " selection.append((score,label))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(selection),len(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation of accuracy" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accuracy = 0\n", "for prediction,correct in zip(scores, y_test):\n", " if prediction>0 and correct==1 or prediction<=0 and correct==0:\n", " accuracy += 1\n", "print(len(scores),accuracy/len(scores))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following evaluation is not fair, because it is not computed on the full training set.\n", "Yet it shows that the subset of documents that get a label is more accurately labeled.\n", "So if such subset is used as a training set for a supervised learning algorithm you can expect to learn a better classifier." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "accuracy = 0\n", "for prediction,correct in selection:\n", " if prediction>0 and correct==1 or prediction<=0 and correct==0:\n", " accuracy += 1\n", "print(len(selection),accuracy/len(selection))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (TA)", "language": "python", "name": "ta" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }