Team:Kent/Methodology

Methodology


MATLAB topic modelling toolbox written by Steyvers, University of California, Irvine [10] was used. This toolkit contained different types of Gibbs sampler.

We used their Gibbs Sampler HMM LDA because this algorithm is more suitable for limited data sets than the standard LDA Gibbs sampler as there is no need to exclude stop words, such as full stop and comma, from the corpus of text, in other words, it takes different syntactic patterns of individual words in account. For example, as the topics are generated from words whose appearance varies between documents, some terms such as “the”, “I”, “is”, etc. which tend to occur in every document without much variation are less important to the construction of topics than words whose appearance varies across documents, e.g. “environment”, “DNA”, “bacteria”, etc. Therefore, in HMM LDA model, syntactic states of tokens do influence the generation of topics [11].



Gibbs Sampler HMM LDA Input


The table below provides the list of input variables that is required for the Gibbs Sampler HMM LDA and their description.

Variable Description
WS 1xN vector where WS(k) contains the vocabulary index of the kth word token, and is the number of word tokens. Word indices are non-zero based unless it is a sentence marker (“.”) which has a word index of 0
DS 1xN vector where DS(k) contains the document index of the kth word token, and is the number of word tokens. Document indices is also non-zero based
WO 1xW cell array of strings where WO(k) contains the kth vocabulary item and is the number of distinctive vocabulary item. In other words, a dictionary of all the words contained in the whole corpus of text that is being analysed.


Our code to convert text documents into this input file is available here. Here is the list of subjects that were analyse using the model:
Source Publish date of the texts Total number of documents (text files)
PubMed entries with the key word “Magnetosome” 1980 - 2016 429
PubMed entries with the key word “Synthetic Biology” 2013 - 2016 9628
iGEM team abstracts 2013 - 2015 693
iGEM team Human practices/Policy and Practices 2013 - 2015 572
Note that the number of documents does not add up to the number of iGEM teams each year, this is because some teams did not have abstract or human practices/P&P.
We ran each of the source’s texts separately.

Here is our code to split PubMed text files when viewed in MEDLINE format.

Teams abstracts, Human practices and P&P were scraped manually.
Parameters Description Value Reference
T Number of topics to be extracted. This was determined by looking at final results. The model was running with 10, 15 and 25 number of topics and the best one was chosen. T=15 for for all source except iGEM team abstracts.
T=25 for iGEM team abstracts
NS Number of syntactic states 12 [12]
N Number of iterations 200 [12]
α Alpha hyperparameter 50/T [12]
β Beta hyperparameter 0.01 [12]
γ Gamma hyperparameter 0.1 [12]
SEED Seed for random number generator 2 [12]


Gibbs Sampler HMM LDA Output


The Gibbs Sampler HMM LDA outputs the following [10]:

Variable Description
WP WxT matrix where W(i,j) contains the number of times words has been assigned to topic .
DP DxT matrix where DP(i,j) contains the number of times a word in document has been assigned to topic .
MP WxNS matrix where MP(i,j) contains the number of times word has been assigned to syntactic state .
Z and X Both has the size 1xn containing the topic and HMM state assignments respectively.


Topic Function Input and Output


The topic writing function reads the following inputs, which are the outputs from Gibbs Sampler HMM LDA [10]:

  • WP
  • BETA (β)
  • WO


For formatting of the result in output text file, the following inputs can be added and set to user preferences.

Variable Description
K The number most likely entities per topic (K) . We chose K=7.
E The threshold on the topics listing in S (string) in which is an output variable.
M The number of columns in the output text file (formatting).
String in programming is a sequence of characters.

Visualisation - Evaluation


The result can be evaluated either using Human-in-the-loop or metric. We used metric to assess the output of our model. Visualize topic function reads the following outputs from Gibbs Sampler HMM LDA [10]:

  • DP
  • Alpha (α)
  • The topic strings S output from the topic function.


The i and j counts in DP are transformed to probability distributions over documents for each topic. For reminder, DP(i,j) contains the number of times a word in document i has been assigned to topic j. The symmetrized KL- distance (Kullback-Leibler) between document distributions is calculated for each topic pair.

In general, KL-distance is difference between two probability distributions and topics are probability distributions. We have edited the code so that the visualisation function draws lines between each topic with a degree of thickness and opacity. The lines become darker and thicker with shorter distances and fainter and light with longer distances, between each topic. Furthermore, instead of just text appearing we have added circles and their size determines its probability of the topic. Bigger mean higher probably, vice-versa. The visualizer is set to only show top seven words in each topic.



Our version of the topic visualizer function is available here.

Other Methods for Evaluation


For interests, results can be evaluated either using Human-in- the-loop or metric. Human-in- the-loop means human interaction with the model and there are two methods in which the humans can interact with this topic model:

  • Word intrusion


    One word from a cluster is swapped with another cluster (intruder) and see whether a human can reliably tell which one is an intruder. If humans are able to find the intruder means that the trained topic is topically coherent (good) [14]. If not the topic has no discernible theme (bad) [15].

  • Topic intrusion


    In this method, an intrusion topic is inserted to a document that contains topics that are assigned to that document. Similar to word intrusion, humans are asked to identify the intrusion topic [13]



Human-in-the-loop is very costly as it is time consuming. Metrics are used to assess how good the model and there are several methods. Here is a list of some [15].
  • Cosine similarity
  • Size (#tokens assigned)
  • Within-doc rank
  • Similarity to corpus-wide distribution
  • Locally-frequent words
  • Co-doc Coherence




email twitter instagram facebook