12 December, 2003

The Gender Genie

As reported by Nature, academics have formulated an algorithm which, when applied to a block of text, can distinguish the gender of the author. It seems to be based on a weighted count of keywords in the text; for example, the total number of instances of 'because' is multiplied by 55 and added to an overall 'female' total. The final 'male' total is compared to the 'female' total.

As one might expect, someone has used the algorithm to generate an online test; a 'Gender Genie'. Have a go. I tested it with a few blog posts, which overwhelmingly confirmed I'm male; some of the 'male' totals were nearly double the 'female' counts.

The 'Nature' article explains that, to oversimplify, men speak in terms of objects ('informational' style, focusing on categorisation), and women in terms of relationships ('involved' style, focusing on personalisation).

I was gratified to recognise a flaw for myself, before reading that it's something that the researchers considered too ;) The subject of the text must matter - the parameters that categorise (oops) males and females might also distinguish between factual and opinion pieces, or between narrative fiction and non-fiction; the algorithm might be dissecting the content rather than the author. Indeed, the program can tell fiction from non-fiction with 98% accuracy. However, when told the genre in advance (and hence invoking a further weighting, presumably), the algorithm separates male from female with 80% accuracy.

Having run this entry through the Gender Genie, the 'male' total is 1077 against a 'female' total of 253, defining me as distinctly male, or maybe just distinctly impersonal in my writing!

That's after I told the Genie that this was a blog entry. I'm curious about the basis of that division, as it had a radical effect on the totals. Defined as 'fiction', the totals are 'male': 577, 'female': 366, and the same as 'non-fiction': 577 to 366; still male, but by less of a margin.

Interestingly, when I tested it with a couple of Helen's e-mails, it was ambiguous e.g. 283 'male' to 291 'female'. Perhaps it's that the test text was descriptive rather than discursive, but H's degree is in linguistics; perhaps that's skewed her written style.
I know that from the age of 15 my education focused strongly on report-writing and essay-based exams in factual subjects, where the impersonal tense ("It was found", rather than "I found") was the only acceptable mode of writing; it's difficult to throw off a certain precision of phrasing.

