I’m a big fan of Wordle. Everybody likes pretty tag clouds, but until recently, I’ve had no practical use for the tool.
What with the forthcoming election and all, and being in marketing, I thought it might be interesting to use Wordle to distill each of the four national parties’ websites into a tag cloud. The cloud would reflect the terms that the party uses most frequently on their English-language websites. With an assist from Ask Metafilter, I got them done. I’ll explain a little more about how after the clouds.
As usual, click for larger versions:
What Conclusions Can We Draw?
That’s more a question for you than me, as I haven’t spent much time trying to grok what these clouds tell us (yes, I used ‘grok’). What jumps out at you?
How Did We Make Them?
First, I grabbed a complete copy of each party’s website. I just stuck with HTML files, so if a party hosts a lot of PDFs with unique content, then that’s not reflected. The sites, of course, ended up being different sizes, and I’m relying on my site-copying software, so I can’t be certain I got all the pages.
Then we concatenated each set of HTML files into one gigantic file. Using some scripty-magic, we generated the top 100 or 250 words, each appearing as many times as they appear in the original site.
I went through each of these to clean out most or all of the leftover HTML code, navigational terms like ’email’ or ‘newsletter’ and French words. The French is why we used 250 words in some cases. For some sites, I downloaded both the French and English version of the site, so I needed to remove the French. By working with a 250 word file, I was able to clean out the French and still have a sizable database of words.
In short, it’s somewhat unscientific, but I’m optimistic that the clouds represent a reasonably fair reflection of each site’s top content. If anyone wants to work with the content I copied, I’m happy to share it. I’m not going to publish the complete sites here, though, as I expect that would constitute a copyright violation.
These are amazing.
It’s a little depressing if you take it all in.
It’s kind of a lesson on political SEO 😛
Hey Great Clouds,
I’m not certain, but I suspect that analysis of these results would be easier if we had the list of words and frequencies not as a cloud. One could then take any words that appear in all four and see the relative importance of those words for each party.
The clouds look pretty though. 🙂
Interesting. Perhaps a similar project that would be interesting is to pick 100 or so words and make a word cloud from each website filtered so just the frequencies for those pre-selected words show up in the visualization.
Interesting how large Harper looms in liberal cloud
Also interesting to note that these clouds, from the websites, probably differ considerably from what you’d get from parsing the parties’ advertising. “Dion” would be a lot bigger on the Conservative site, for instance, if the ads I’ve seen are any indication.
Brilliant! May I request you repeat the process for your neighbors to the south? You’d only have to do it twice… 🙂 Although I don’t think we’d see anything surprising pop out of a republican or democrat cloud. thanks.
I too am intrigued by how large “Harper” looms in the Liberal cloud — my two cents says they should be going pro-Dion, not anti-Harper. But, maybe that’s why they don’t pay me the big bucks… Anyway, fascinating.
This would be SO well supplemented with a content analysis of the websites, with word frequencies and stuff. I use NVivo (a qualitative research software) and maybe it could help shed some more light. This is a great first analysis. The second-tier analysis would need to be a study of the rhetoric (and yes, I would think that those who have degrees in English should know very well how to do that – I have some idea, but not really solid).
These clouds are fascinating. You mention that you copied the HTML from the sites with some software…which software did you use? I think your idea could be applied to lots of situations where we’d like to compare what is being said by an organization or a political party’s representatives to what their website information actually indicates.
I used a Windows program called WinHTTrack.
Hi
I want change the tag cloud into a text viewer tool that run locally on my harddrive and make it a software, how can i import large amount of data,say 80m text file? what else should i know to achieve that ?could you shed some light on this issue?