Last week I looked at mining news data from Event Registry. Like any good data scientist, I’m going to check the quality of my data before using it in an analysis. But what does it mean to check the quality of the data? Well, that really depends on how you want to use it. Here, my first project is to look at comparisons between words in order to find implicit bias and compare that bias across different news sources. So… we probably want to measure the quality of the comparisons, right? Right!
Building the models
I already showed some basics of building a Word2Vec model earlier in my blog, so I’m not going to go over that in great detail. Actually, I’m really just going to wave my magic wand and say *poof*. Oh look! The data from last week has magically turned into a Word2Vec model.
Checking Comparison Accuracy
In several posts12, I have used a collection of pre-built analogies to test the accuracy of my model, but I never gave a thorough explanation of what it is that I am doing.
The link to pre-built analogies is just a loooong list with each line containing four words, e.g. brother sister king queen
. When I assess the accuracy of the model: I look at each line and count if, given the first three words, my model accurately predicts the fourth3. The analogies come pre-packed into categories:
Category | Example |
capital-common-countries |
Athens Greece Baghdad Iraq |
capital-world |
Abuja Nigeria Accra Ghana |
currency |
Algeria dinar Argentina peso |
city-in-state |
Chicago Illinois Houston Texas |
family |
boy girl brother sister |
gram1-adjective-to-adverb |
amazing amazingly apparent apparently |
gram2-opposite |
acceptable unacceptable aware unaware |
gram3-comparative |
bad worse big bigger |
gram4-superlative |
bad worst big biggest |
gram5-present-participle |
code coding dance dancing |
gram6-nationality-adjective |
Albania Albanian Argentina Argentinean |
gram7-past-tense |
dancing danced decreasing decreased |
gram8-plural |
banana bananas bird birds |
gram9-plural-verbs |
decrease decreases describe describes |
So how did the data do?
Well, not so great.
Conservative news sources:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
capital-common-countries: 41/462 ( 8.87%) capital-world: 32/868 ( 3.69%) currency: 0/68 ( 0.00%) city-in-state: 39/1569 ( 2.49%) family: 83/342 (24.27%) gram1-adjective-to-adverb: 0/756 ( 0.00%) gram2-opposite: 8/342 ( 2.34%) gram3-comparative: 36/1056 ( 3.41%) gram4-superlative: 27/420 ( 6.43%) gram5-present-participle: 45/702 ( 6.41%) gram6-nationality-adjective: 15/1095 ( 1.37%) gram7-past-tense: 87/1406 ( 6.19%) gram8-plural: 17/812 ( 2.09%) gram9-plural-verbs: 17/552 ( 3.08%) total: 447/10450 ( 4.28%) |
Neutral news sources:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
capital-common-countries: 59/380 (15.53%) capital-world: 68/704 ( 9.66%) currency: 1/178 ( 0.56%) city-in-state: 47/1403 ( 3.35%) family: 101/306 (33.01%) gram1-adjective-to-adverb: 9/812 ( 1.11%) gram2-opposite: 19/342 ( 5.56%) gram3-comparative: 249/1260 (19.76%) gram4-superlative: 138/702 (19.66%) gram5-present-participle: 234/870 (26.90%) gram6-nationality-adjective: 126/849 (14.84%) gram7-past-tense: 406/1482 (27.40%) gram8-plural: 85/930 ( 9.14%) gram9-plural-verbs: 137/600 (22.83%) total: 1679/10818 (15.52%) |
Liberal news sources:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
capital-common-countries: 126/420 (30.00%) capital-world: 154/805 (19.13%) currency: 0/40 ( 0.00%) city-in-state: 49/901 ( 5.44%) family: 214/342 (62.57%) gram1-adjective-to-adverb: 46/930 ( 4.95%) gram2-opposite: 50/420 (11.90%) gram3-comparative: 543/1190 (45.63%) gram4-superlative: 246/702 (35.04%) gram5-present-participle: 418/992 (42.14%) gram6-nationality-adjective: 288/907 (31.75%) gram7-past-tense: 643/1560 (41.22%) gram8-plural: 197/1056 (18.66%) gram9-plural-verbs: 251/600 (41.83%) total: 3225/10865 (29.68%) |
Why did liberal sources perform so much better?
This is a simple case of numbers. I was able to identify and use four news sources for my conservative data, eight sources for neutral data, and twelve sources for liberal data. In these HUGE vector spaces, more data means a better model. Right now I see two problems with using these data sets:
- They are just not accurate enough. To rely on comparisons to show implicit bias, I want to make sure my models actually work for, you know, comparing things.
- They are not accurate enough relative to each other. This is SO IMPORTANT! I could probably pull out some ridiculous examples of “comparisons” that conservative sources had– but if, because of my data collection method, those sources are systematically worse at comparisons, we can’t pit the sources against one another.
So what are you going to do about it?
Great question! I’m glad you asked it. As I see it, there are a few ways in particular that I can go about fixing this4:
- The main limitation with Event Registry is that I could only download articles from the last month. I could download a new set each month and aggregate the results until all of the sources perform equally on basic comparisons.
- What if I found a more powerful way to download news articles– there are a few methods out there for scraping large amounts of data from the internet, but sadly none of them are quite as user-friendly as Event Registry.
- I might do some other analysis on the data in the meantime. Is there some way to feed an article into an algorithm and have that algorithm classify it on a scale from liberal to conservative? I bet a keyword analysis would still work well on this data, because you’re mostly looking at words that aren’t there.
- All of the above.5
I might let this project simmer on the back burner while I churn my brain over fun new ways to overcome these challenges, but never fear! I will have wonderful data science-y (or chemistry-y) content for you next week no matter what!