Data Analysis , , , , ,

Testing the data: first steps before analysis

Last week I looked at mining news data from Event Registry.  Like any good data scientist, I’m going to check the quality of my data before using it in an analysis.  But what does it mean to check the quality of the data?  Well, that really depends on how you want to use it.  Here, my first project is to look at comparisons between words in order to find implicit bias and compare that bias across different news sources.  So… we probably want to measure the quality of the comparisons, right?  Right!

Building the models

I already showed some basics of building a Word2Vec model earlier in my blog, so I’m not going to go over that in great detail.  Actually, I’m really just going to wave my magic wand and say *poof*.  Oh look!  The data from last week has magically turned into a Word2Vec model.

Checking Comparison Accuracy

In several posts12, I have used a collection of pre-built analogies to test the accuracy of my model, but I never gave a thorough explanation of what it is that I am doing.

The link to pre-built analogies is just a loooong list with each line containing four words, e.g. brother sister king queen.  When I assess the accuracy of the model: I look at each line and count if, given the first three words, my model accurately predicts the fourth3.  The analogies come pre-packed into categories:

Category Example
capital-common-countries Athens Greece Baghdad Iraq
capital-world Abuja Nigeria Accra Ghana
currency Algeria dinar Argentina peso
city-in-state Chicago Illinois Houston Texas
family boy girl brother sister
gram1-adjective-to-adverb amazing amazingly apparent apparently
gram2-opposite acceptable unacceptable aware unaware
gram3-comparative bad worse big bigger
gram4-superlative bad worst big biggest
gram5-present-participle code coding dance dancing
gram6-nationality-adjective Albania Albanian Argentina Argentinean
gram7-past-tense dancing danced decreasing decreased
gram8-plural banana bananas bird birds
gram9-plural-verbs decrease decreases describe describes

So how did the data do?

Well, not so great.

Conservative news sources:

Neutral news sources:

Liberal news sources:

Why did liberal sources perform so much better?

This is a simple case of numbers.  I was able to identify and use four news sources for my conservative data, eight sources for neutral data, and twelve sources for liberal data.  In these HUGE vector spaces, more data means a better model.  Right now I see two problems with using these data sets:

  1. They are just not accurate enough.  To rely on comparisons to show implicit bias, I want to make sure my models actually work for, you know, comparing things.
  2. They are not accurate enough relative to each other.  This is SO IMPORTANT!  I could probably pull out some ridiculous examples of “comparisons” that conservative sources had– but if, because of my data collection method, those sources are systematically worse at comparisons, we can’t pit the sources against one another.

So what are you going to do about it?

Great question!  I’m glad you asked it.  As I see it, there are a few ways in particular that I can go about fixing this4:

  1.  The main limitation with Event Registry is that I could only download articles from the last month.  I could download a new set each month and aggregate the results until all of the sources perform equally on basic comparisons.
  2. What if I found a more powerful way to download news articles– there are a few methods out there for scraping large amounts of data from the internet, but sadly none of them are quite as user-friendly as Event Registry.
  3. I might do some other analysis on the data in the meantime.  Is there some way to feed an article into an algorithm and have that algorithm classify it on a scale from liberal to conservative?  I bet a keyword analysis would still work well on this data, because you’re mostly looking at words that aren’t there.
  4. All of the above.5

I might let this project simmer on the back burner while I churn my brain over fun new ways to overcome these challenges, but never fear!  I will have wonderful data science-y (or chemistry-y) content for you next week no matter what!

Facebooktwitterredditpinterestlinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *