Simulated Humanist Mind | LDA DOA? LDA Topic Modelling Versus Topic Mapping

Games

LDA DOA? LDA Topic Modelling Versus Topic Mapping

Posted by Craig on February 24th, 2015

I apologize for the sensationalist (and somewhat senselessly constructed) title, but what is a blog for if not to craft such things?

 

One of the pillars of the emerging DH scene has been the use of Latent Dirichlet Allocation based topic modelling. For those not in the know, this is a machine learning technique that uses properties of words and documents to discover “topics” that occur in the texts.  For example, if your corpus contains a significant amount of documents about pet care, you might expect the algorithm to return a topic full of cat-related words. This is the barest of explanations — if you’d like to know more please see either the Blei paper (here: https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf) for a more technical introduction.  If I remember correctly Matt Jockers also has an introduction to the concept aimed at humanists in Macroanalysis.

 

No matter your level of familiarity the important facet is that LDA topic modelling has found use in numerous projects within (and without) the humanities.  And now at least one generally accepted part of it is under a sort of attack.

 

The salvo was launched in this paper: A high-reproducibility and high-accuracy method for automated topic classification by a group of scientists working out of Northwestern University 1

 

So what’s the issue? According to Lancichinetti et. al. the tests they ran on standard LDA-based topic modelling algorithms indicate that:

 

“PLSA and the standard optimization algorithm implemented with LDA (variational inference) are systematically unable to find the global maximum of the likelihood landscape…these algorithms have surprisingly low accuracy and reproducibility, especially when topic sizes are unequal. Taken together, the results in this section clearly demonstrate that the approach taken by standard topic-model algorithms for exploring the likelihood landscape is extremely inefficient, whether one starts from random initial conditions or by randomly seeding the topics using a sample of documents.” 2

As I understand it, the authors discovered that standard LDA topic modelling implementations using these optimization techniques will especially have issues with corpora that are “oligarchic” — where a the size subset of topics dominate the rest. I can’t say I consider it my place to either contest or confirm this conclusion (after all, I probably understood little more than half of the paper). What intrigues me is that Lancichinetti et. al. provide an alternative.

They call their alternative topic mapping and it relies on a network/graph based approached to the documents and words therein to work. From the paper:

“One can take a corpus and construct a bipartite network of words and documents, where a word and document are connected if the word appears in the document. This bipartite network can be projected onto a unipartite network of words by connecting words that coappear in a document. In the language corpora, separating documents using distinct languages is as trivial as finding the connected components of word network.” 3

Thanks to the era of limitless source repository space and cross platform standards, there already exists an implementation of topic mapping available here, presumably (given the username) written by Andrea Lancichinetti themself: https://bitbucket.org/andrealanci/topicmapping.

So in the spirit great humanistic tradition of bricolage (aka. mucking around) I decided to give topic mapping a spin. I thought it might be most useful to do a qualitative comparison against traditional LDA topic mapping techniques provided by the excellent gensim python library by Radhim Rehurek and by Mallet (as wrapped by gensim). I haven’t dumped the code I wrote anywhere, but if anyone is dying to see it let me know.  It’s fairly boilerplate.

I assembled a small corpus of 13 Hawthorne and Melville texts (I can provide a complete list for the interested, but as we will see, the topics will mostly out). This isn’t the best corpus to test the hypothesis put forth by Lancichinetti et. al. (at all) but my intentions here were just to see if any qualitative differences showed up in a trial run.  I hypothesized that given 13 texts, there should be 13 topics, especially since I did not strip proper names out of each text.  I used the nltk stopwords list to remove stopwords, but I did not “stem” the texts as per Lancichinetti et. al’s methodology, which would count “star” and “stars”, for example as the same token (another minor strike).

Here’s where one major difference of topic mapping stands out. While traditional topic modelling requires an a priori guess at the number of topics in the corpus, topic mapping does not. Logically then I would run it first, and use the results to guide my traditional LDA runs.   The code compiled easily (I used OSX, any *nix environment should work fine) and ran easily from the command line.  However, despite the relatively inconsequential size of my corpus, my work machine (which is fairly beefy) could barely handle everything going on.  Ultimately, it took well over an hour (a few hours perhaps? Sunday time is of no consequence) to finish.  That done, here are the results:
topic: 0 #words: 12060 pt: 0.0907336
old one would upon man said like little yet could mr long may house might though among hand young heart even two us new face time shall great life seemed people till peter many still within day good come made never whole years must away well world door lady black forth head ever death every came eyes woman almost first back another street look without mind perhaps men town figure beneath whose let see dead three much province earth far round night light found cried stood moment half along voice thought around place children spirit nothing left make last youth

topic: 1 #words: 8540 pt: 0.0445213
one zenobia hollingsworth would said priscilla little like could man us life upon might old never well heart great seemed world much yet made among come may woman hand must first two good time eyes zenobias make many shall long another mr away ever still thought face let better however nothing enough know lady love even see answered far towards day young take coverdale think kind rather perhaps look say often blithedale whole looked way poor though thus back least within girl true without part quite room women tell hardly men purpose nature new whether came matter half every beautiful

topic: 2 #words: 10362 pt: 0.056817
one man would sir said upon good like old may little though say much confidence dont way well think yet friend know must see could still sort ugh might nature yes never time come something less china take without kind long last seemed cosmopolitan oh aster barber dear go indeed let world true talk look two men first stranger nothing hand thought poor mind thing heart friends boy tell doctor made case put herb even life shall trust money fellow pray back certain another indian since words make better gentleman orchis away day things business wine air human small charity

topic: 3 #words: 11864 pt: 0.0611614
upon like one sea old us would long though samoa little time many said jarl much round yet media came last could must water still king good chapter thus day yillah seemed far every fish boat man might away first ever three great way hand side air full among two night men things till made well may high without however craft thing sun babbalanja soon mardi ship never donjalolo place land thou almost days let white dead thought island along sight head kings oh yoomy light went times nothing toward right eyes come lord islands hard even canoe true heart

topic: 4 #words: 12270 pt: 0.0671587
said babbalanja lord media one like old us upon cried mohi yoomy mardi yet would many must man long king though may last thus great things thou round men oh let oro still thy much every good came say come time ye land chapter full know never way without made ever first sea see life eyes sun away vivenza among three bello alma hand could go heart day thee beard well abrazza gods kings far wine another thing yillah ah till none air seemed thousand soon tell even little vee soul night make right death think others isle dominora might

topic: 5 #words: 17256 pt: 0.135343
whale one like upon man sea old ship ahab ye would whales though yet head time long still captain great said two boat seemed must white last way thou see little round three sperm say may well men first stubb every us queequeg much good could hand side never look ever deck almost go even water thing boats away might come starbuck made sir chapter day life ships among many fish seen far back world line oh cried eyes without aye sort right thought part know night air crew whole take god half let tell hands thus things whaling thee

topic: 6 #words: 8376 pt: 0.0445435
one upon would said man might could old life yet hand heart like may little giovanni seemed well world made thou young must face even good still nature many eyes owen away shall much beatrice beautiful love came nothing among human head within another figure come time thought however thy voice spirit though ever aylmer cried first without whether let us mother great thus day long every whole far forth men reuben poor see mr annie almost brown father never pipe goodman know take say deep woman city perhaps two whose georgiana garden words moment light make bosom truth looked

topic: 7 #words: 10629 pt: 0.0625761
one upon us old like long time little could would two doctor men man made came among round good island nothing captain went ship first sea way last natives many said great well day every thus much away place tahiti going though several soon never quite sailors three might part small deck still mate chapter morning however side along head must without far rest night say something left hand native right ashore see house consul ghost come go ever looking water land called seemed thought hard set young looked islands told found po enough present make white end sailor jermin

topic: 8 #words: 13962 pt: 0.0957169
pierre one would thou isabel thee upon yet still lucy man though thy little world could long said old mother first time never like shall thing come must well seemed love life things ever heart see soul two house go may hand face thought without last room know oh thus glendinning far way father nothing day might brother god every say made strange sweet young eyes mind let great men much good even indeed us night many sister came entirely hath light away felt look mrs toward general door tell half pierres portrait always round whole glen feel book dear

topic: 9 #words: 11315 pt: 0.0705161
one would like upon old could sea little said man though thought must sailors ship time much never every harry made last captain long good great well liverpool seemed thing day many looking way see yet first men went go two place deck going might almost among head round us new hand three looked even used something came ships till called look make know nothing mate think back young cabin sailor away house boy world water may along come morning night take eyes sort side ever without dock poor still board soon say indeed found perhaps book full street home

topic: 10 #words: 8414 pt: 0.0506771
hester would thou little one pearl old child man might could upon life minister prynne said like letter heart mother mr good may yet scarlet hand long never made seemed dimmesdale even must another eyes still thee time many new ever much woman thy world years come day well roger nature human face men look within chillingworth shall came answered people see house indeed better whether among bosom physician whose thus great forest place however seen first though without looked young public stood speak make forth far truth since two kind soul black mothers nothing light smile less along thought

topic: 11 #words: 9777 pt: 0.0640407
hepzibah one phoebe old would pyncheon clifford little house said could man might judge like long life great made much never upon well must may time seemed still yet door shop day young two world heart good pyncheons hand cousin however window face seven without look even make see poor street almost ever many gables first another years away way better half nothing back human come moment character perhaps indeed hepzibahs let garden family alice know eyes death holgrave kind far shall maule whole smile within new us towards among whether since thus mr along go mind take dead hardly

topic: 12 #words: 10568 pt: 0.0679189
one upon would valley us could kory little toby time every natives might two like many among never typee house soon however old long day still made islanders seemed side three place appeared great hand sea part first nukuheva towards young way moment almost water along men ever much last feet seen whole head fruit another savages island mehevi nothing well without man may away islands thus several good said bay thought pi together even although ground typees people must view around mind back whose taboo appearance life small trees far cocoanut indeed looking saw chief nearly near companion yet

topic: 13 #words: 13486 pt: 0.0882759
man one upon men like captain would war deck old time ship sea must said though top officers may every navy many jacket long white day yet sailors jack two first us frigate main last board among ships much never round gun even good could hands hand quarter well three see night still made almost little say sailor sir ever commodore american without way mess crew chapter let arms whole away great master wars called might neversink seamen mast head officer sail cried guns lieutenant life room water fore four law world shall things indeed found times part thus side
<div style="text-align: left;" data-canvas-width="284.87296305">
So we end up with 14 topics, most of them clearly signalling one text. Topic 13 is the most generic — while it might be Melville’s “White-Jacket” it might also just be an amalgam of seafaring words, accounting for the “extra” topic.
With 14 as the lucky number, I ran my gensim LDA task:
topic #0 (0.071): 0.000*whale + 0.000*upon + 0.000*like + 0.000*old + 0.000*one + 0.000*pierre + 0.000*man + 0.000*said + 0.000*sea + 0.000*ye</div>

topic #1 (0.071): 0.000*hepzibah + 0.000*phoebe + 0.000*pierre + 0.000*pyncheon + 0.000*upon + 0.000*one + 0.000*like + 0.000*old + 0.000*would + 0.000*pyncheons</div>

topic #2 (0.071): 0.007*upon + 0.007*us + 0.006*one + 0.006*like + 0.005*said + 0.004*many + 0.004*old + 0.004*long + 0.003*lord + 0.003*would</div>

topic #3 (0.071): 0.009*one + 0.007*man + 0.006*upon + 0.006*like + 0.005*captain + 0.005*ship + 0.005*sea + 0.005*would + 0.004*men + 0.004*time</div>

topic #4 (0.071): 0.177*pierre + 0.046*isabel + 0.035*lucy + 0.017*glendinning + 0.011*pierres + 0.010*delly + 0.007*tartan + 0.007*guitar + 0.005*millthorpe + 0.004*isabels</div>

topic #5 (0.071): 0.000*one + 0.000*pierre + 0.000*ugh + 0.000*man + 0.000*would + 0.000*said + 0.000*like + 0.000*cosmopolitan + 0.000*yet + 0.000*aster</div>

topic #6 (0.071): 0.000*hepzibah + 0.000*phoebe + 0.000*pyncheon + 0.000*one + 0.000*like + 0.000*would + 0.000*upon + 0.000*said + 0.000*pyncheons + 0.000*man</div>

topic #7 (0.071): 0.007*would + 0.006*one + 0.006*little + 0.006*could + 0.006*might + 0.005*old + 0.004*house + 0.004*man + 0.004*said + 0.004*upon</div>

topic #8 (0.071): 0.000*pierre + 0.000*isabel + 0.000*lucy + 0.000*would + 0.000*one + 0.000*glendinning + 0.000*could + 0.000*upon + 0.000*yet + 0.000*like</div>

topic #9 (0.071): 0.016*zenobia + 0.013*hollingsworth + 0.011*priscilla + 0.008*docks + 0.004*zenobias + 0.004*women + 0.004*cutter + 0.003*us + 0.003*coverdale + 0.003*blithedale</div>

topic #10 (0.071): 0.000*ahab + 0.000*queequeg + 0.000*stubb + 0.000*one + 0.000*upon + 0.000*starbuck + 0.000*nantucket + 0.000*whale + 0.000*man + 0.000*like</div>

topic #11 (0.071): 0.010*thou + 0.007*yet + 0.006*thee + 0.005*old + 0.005*thy + 0.005*still + 0.004*mother + 0.004*one + 0.004*world + 0.004*would</div>

topic #12 (0.071): 0.000*hepzibah + 0.000*phoebe + 0.000*pyncheon + 0.000*pyncheons + 0.000*one + 0.000*hepzibahs + 0.000*holgrave + 0.000*maule + 0.000*pierre + 0.000*maules</div>

topic #13 (0.071): 0.010*one + 0.010*man + 0.008*would + 0.007*said + 0.005*upon + 0.005*may + 0.005*old + 0.005*though + 0.005*good + 0.004*like</div>
and Mallet
topic #0 (3.571): 0.013*house + 0.009*hepzibah + 0.008*phoebe + 0.007*pyncheon + 0.007*door + 0.006*clifford + 0.006*mr + 0.006*street + 0.005*judge + 0.004*peter</div>

topic #1 (3.571): 0.016*man + 0.011*long + 0.011*time + 0.009*life + 0.007*made + 0.006*great + 0.005*present + 0.005*small + 0.005*back + 0.005*day</div>

topic #2 (3.571): 0.011*ship + 0.011*sailors + 0.009*captain + 0.008*thought + 0.007*made + 0.007*mate + 0.007*deck + 0.006*cabin + 0.006*ships + 0.006*great</div>

topic #3 (3.571): 0.006*life + 0.006*world + 0.006*heart + 0.005*eyes + 0.005*character + 0.005*human + 0.004*beautiful + 0.004*kind + 0.004*hand + 0.004*love</div>

topic #4 (3.571): 0.064*pierre + 0.018*thou + 0.017*thee + 0.017*isabel + 0.013*thy + 0.013*lucy + 0.010*mother + 0.009*world + 0.008*soul + 0.007*house</div>

topic #5 (3.571): 0.023*man + 0.015*war + 0.014*captain + 0.014*deck + 0.014*men + 0.010*officers + 0.009*top + 0.008*jack + 0.008*ship + 0.008*jacket</div>

topic #6 (3.571): 0.006*day + 0.006*time + 0.005*head + 0.004*moment + 0.004*side + 0.004*hand + 0.003*people + 0.003*passed + 0.003*white + 0.003*feet</div>

topic #7 (3.571): 0.015*round + 0.008*men + 0.008*long + 0.008*air + 0.007*high + 0.006*full + 0.005*white + 0.005*things + 0.004*sun + 0.004*thousand</div>

topic #8 (3.571): 0.042*whale + 0.016*ahab + 0.015*ye + 0.015*ship + 0.015*whales + 0.010*captain + 0.009*thou + 0.009*sperm + 0.008*boat + 0.008*stubb</div>

topic #9 (3.571): 0.023*babbalanja + 0.020*media + 0.020*lord + 0.013*yoomy + 0.012*mohi + 0.012*mardi + 0.012*king + 0.012*cried + 0.008*chapter + 0.007*yillah</div>

topic #10 (3.571): 0.012*thou + 0.010*hester + 0.008*child + 0.007*mother + 0.007*heart + 0.007*pearl + 0.006*man + 0.006*minister + 0.006*thy + 0.005*hand</div>

topic #11 (3.571): 0.021*man + 0.016*sir + 0.013*good + 0.012*dont + 0.010*confidence + 0.008*friend + 0.007*nature + 0.007*sort + 0.006*china + 0.005*dear</div>

topic #12 (3.571): 0.010*valley + 0.010*natives + 0.008*island + 0.008*kory + 0.007*house + 0.006*doctor + 0.006*toby + 0.006*islands + 0.005*islanders + 0.005*long</div>

topic #13 (3.571): 0.022*sea + 0.010*water + 0.009*hand + 0.008*thing + 0.008*side + 0.007*eyes + 0.007*long + 0.006*time + 0.006*sight + 0.006*night
These traditional LDA methods follow mostly along the lines of our hypothesis, but more categories seem ambiguous or doubled up (IE topics 6 and 7 of Mallet might be both “Moby-Dick” or might just be ambiguous).  So in this limited test, topic mapping seems “better” — I didn’t mess around too much with settings, and changing topic amounts might help, but this drags us away from our initial hypothesis.  However, there is another consideration in mind here.  Since we are evaluating qualitatively, there might be an advantage to overfitting in that it could lead to the creation of novel subtopics that topic mapping (which fit one topic/work quite well, perhaps authorship attribution applications are possible) would not.  In any case, some more experimentation with larger corpuses, removing character proper names, stemming, seem warranted, but topic mapping certainly seems to hold the ability to discover myriad new textual relationships interesting to the critic.
  1. For a more full citation: Andrea Lancichinetti, M. Irmak Sirer, Jane X. Wang, Daniel Acuna, Konrad Körding, Luís A. Nunes Amaral, “A high-reproducibility and high-accuracy method for automated topic classification” in Physical Review X 5, 011007 (2015)
  2. Lancichinetti et. al. page 4
  3. Ibid. Page 5

Comments

Sridhar Nerur says:

Hi,
Thanks for your article comparing LDA vs Topic Mapping. Would you mind sharing the Python code that you used for the comparison? If you are unable to, I fully understand.
Best regards,
Sridhar

Craig says:

Sorry Sridhar, I’ve really neglected the blog. If you still need the code for whatever reason I’d be glad to throw it on Github.

Leave a Reply to Craig Cancel reply

Your email address will not be published. Required fields are marked *