We live in an age where information is abundant and easily accessible, an information age. Not only they are available but they are crucial part of our modern day society. An easy example would be Google which uses topic model algorithm (a branch of machine learning) to cluster the terms entered into their search engine. This allows them to do market segmentation, learning people’s preferences, predict search terms and build a content recommendation system. An example would be the annoying Google ads that we deal with everyday. By knowing this it is clear that Google isn’t the only one who is implementing this tool, big companies like Facebook, amazon, Yahoo etc. are all taking advantage and constantly improving their model.

Knowing the capabilities of topic models, we decided to practice and apply it to analyse our project aims quantitatively as a part of our integrated human practices. Big Data approach such as topic modelling could prove itself to be a powerful and useful quantitative tool for iGEM projects. From this our main aim was to analyse our hypothesis and its impact in society and in synthetic biology communities.

Topic Model

Topic model allows us to identify boardbroad and subtitle patterns in a particular subject that we would otherwise unable to detect due to our inability to process such a huge collection of texts [1]. This is very useful as it provides a simple way to analyse large volumes of unlabelled text. In other words, identifying the topic on which the specific content is talking about. One of the applications is text summarisation which basically summarises the data presented to the user, and an example of the technology that utilises this would be search engines such as Google.

Background – What is Topic Modelling?

Topic modelling is one of many applications of machine learning which is a type of artificial intelligence (AI) that grants computers with the ability to learn without being explicitly programmed [2]. It looks for patterns in data but also uses the extracted data to detect patterns in data and adjust its program actions accordingly. There are two main groups of machine learning applications: Supervised learning and unsupervised learning.

As seen in figure 1, if the datasets contain labels it is classed as supervised learning. Under supervised learning if the label is categorical it is talking about classification and if the label is quantitative it is talking about regression [3]. In unsupervised learning there are no such labels therefore clustering of the datasets or finding latent variables/structure in the datasets is performed [4].

Clustering or cluster analysis groups a set of similar objects closer together, in comparison to other objects in the same set of data [5]. To get more general understanding here are the characteristics of topic models:

Exploratory: to discover, browse or search large collections of unlabelled text data
Latent variables: extracting hidden thematic structure (not labelled). It could be abstract topics, which could be a cluster of words and together they have some kind of meaning
Clustering: cluster of words and collection of documents

Types of Clustering and Topic Model Clustering

In machine learning there are actually different kinds of clustering:

Hard clustering is where each instance of your dataset can only belong to one cluster as there are hard division on lines. Hierarchical clustering includes sub clusters in the cluster itself where the sub clusters have ‘child and parent’ like relationships. Soft/Fuzzy clustering is used by topic modelling. Each instance can belong to each cluster with certain degree. Certain degree can be thought of as a percentage or a probability.

Understanding the Machine Learning Algorithm

It is important to know what a program does before asking how it works. The black box in figure 3 represents the algorithm of a topic model. As illustrated above, input is going to be a collection of text documents and there will be three outputs:

Cluster of words where each cluster will define a topic
Frequency of the words appearing in the topics
Distribution of topics in documents. For an example, politic and cultures might appear in the same document and such.

Furthermore, we know that one word can belong to several clusters meaning words have different meaning in different contexts. This is also represented in this type of modelling.

Latent Dirichlet Allocation

One of the popular algorithm in topic modelling is called Latent Dirichlet Allocation, commonly known as LDA. However, there is another model (not topic) called LDA but it stands for Linear Discriminant analysis which is used for supervised learning. Hence, these two LDAs have no relation to each other therefore it is important to be clear about which LDA model is being referred to early. And of course we are only concerned with Latent Dirichlet Allocation.

Team:Kent/Model

Model

Topic Model

Background – What is Topic Modelling?

Types of Clustering and Topic Model Clustering

Understanding the Machine Learning Algorithm

Latent Dirichlet Allocation