Methodology
MATLAB topic modelling toolbox written by Steyvers, University of California, Irvine [10] was used. This toolkit contained different types of Gibbs sampler.
We used their Gibbs Sampler HMM LDA because this algorithm is more suitable for limited
data sets than the standard LDA Gibbs sampler as there is no need to exclude stop words, such as full
stop and comma, from the corpus of text, in other words, it takes different syntactic patterns of individual words in account. For example, as the topics are generated from words whose appearance varies between documents, some terms such as “the”, “I”, “is”, etc. which tend to occur in every document without much variation are less important to the construction of topics than words whose appearance varies across documents, e.g. “environment”, “DNA”, “bacteria”, etc. Therefore, in HMM LDA model, syntactic states of tokens do influence the generation of topics [11].
Gibbs Sampler HMM LDA Input
The table below provides the list of input variables that is required for the Gibbs Sampler HMM LDA and their description.
Variable | Description |
---|---|
WS | 1xN vector where WS(k) contains the vocabulary index of the kth word token, and is the number of word tokens. Word indices are non-zero based unless it is a sentence marker (“.”) which has a word index of 0 |
DS | 1xN vector where DS(k) contains the document index of the kth word token, and is the number of word tokens. Document indices is also non-zero based |
WO | 1xW cell array of strings where WO(k) contains the kth vocabulary item and is the number of distinctive vocabulary item. In other words, a dictionary of all the words contained in the whole corpus of text that is being analysed. |
Our code to convert text documents into this input file is available here. Here is the list of subjects that were analyse using the model:
Source | Publish date of the texts | Total number of documents (text files) |
---|---|---|
PubMed entries with the key word “Magnetosome” | 1980 - 2016 | 429 |
PubMed entries with the key word “Synthetic Biology” | 2013 - 2016 | 9628 |
iGEM team abstracts | 2013 - 2015 | 693 |
iGEM team Human practices/Policy and Practices | 2013 - 2015 | 572 |
We ran each of the source’s texts separately.
Here is our code to split PubMed text files when viewed in MEDLINE format.
Teams abstracts, Human practices and P&P were scraped manually.
Parameters | Description | Value | Reference |
---|---|---|---|
T | Number of topics to be extracted. This was determined by looking at final results. The model was running with 10, 15 and 25 number of topics and the best one was chosen. | T=15 for for all source except iGEM team abstracts.
T=25 for iGEM team abstracts |
|
NS | Number of syntactic states | 12 | [12] |
N | Number of iterations | 200 | [12] |
α | Alpha hyperparameter | 50/T | [12] |
β | Beta hyperparameter | 0.01 | [12] |
γ | Gamma hyperparameter | 0.1 | [12] |
SEED | Seed for random number generator | 2 | [12] |
Gibbs Sampler HMM LDA Output
The Gibbs Sampler HMM LDA outputs the following [10]:
Variable | Description |
---|---|
WP | WxT matrix where W(i,j) contains the number of times words has been assigned to topic . |
DP | DxT matrix where DP(i,j) contains the number of times a word in document has been assigned to topic . |
MP | WxNS matrix where MP(i,j) contains the number of times word has been assigned to syntactic state . |
Z and X | Both has the size 1xn containing the topic and HMM state assignments respectively. |
Topic Function Input and Output
The topic writing function reads the following inputs, which are the outputs from Gibbs Sampler HMM LDA [10]:
- WP
- BETA (β)
- WO
For formatting of the result in output text file, the following inputs can be added and set to user preferences.
Variable | Description |
---|---|
K | The number most likely entities per topic (K) . We chose K=7. |
E | The threshold on the topics listing in S (string) in which is an output variable. |
M | The number of columns in the output text file (formatting). |
Visualisation - Evaluation
The result can be evaluated either using Human-in-the-loop or metric. We used metric to assess the output of our model. Visualize topic function reads the following outputs from Gibbs Sampler HMM LDA [10]:
- DP
- Alpha (α)
- The topic strings S output from the topic function.
The i and j counts in DP are transformed to probability distributions over documents for each topic. For reminder, DP(i,j) contains the number of times a word in document i has been assigned to topic j. The symmetrized KL- distance (Kullback-Leibler) between document distributions is calculated for each topic pair.
In general, KL-distance is difference between two probability distributions and topics are probability distributions. We have edited the code so that the visualisation function draws lines between each topic with a degree of thickness and opacity. The lines become darker and thicker with shorter distances and fainter and light with longer distances, between each topic. Furthermore, instead of just text appearing we have added circles and their size determines its probability of the topic. Bigger mean higher probably, vice-versa. The visualizer is set to only show top seven words in each topic.
Our version of the topic visualizer function is available here.
Other Methods for Evaluation
For interests, results can be evaluated either using Human-in- the-loop or metric. Human-in- the-loop means human interaction with the model and there are two methods in which the humans can interact with this topic model:
- Word intrusion
One word from a cluster is swapped with another cluster (intruder) and see whether a human can reliably tell which one is an intruder. If humans are able to find the intruder means that the trained topic is topically coherent (good) [14]. If not the topic has no discernible theme (bad) [15].
- Topic intrusion
In this method, an intrusion topic is inserted to a document that contains topics that are assigned to that document. Similar to word intrusion, humans are asked to identify the intrusion topic [13]
- Cosine similarity
- Size (#tokens assigned)
- Within-doc rank
- Similarity to corpus-wide distribution
- Locally-frequent words
- Co-doc Coherence