We live in an age where information is abundant and easily accessible, an information age. Not only they are
available but they are crucial part of our modern day society. An easy example would be Google which uses
topic model algorithm (a branch of machine learning) to cluster the terms entered into their search engine.
This allows them to do market segmentation, learning people’s preferences, predict search terms and build a
content recommendation system. An example would be the annoying Google ads that we deal with everyday. By knowing this it is clear
that Google isn’t the only one who is implementing this tool, big companies like Facebook, amazon, Yahoo etc.
are all taking advantage and constantly improving their model.
Knowing the capabilities of topic models, we decided to practice and apply it to analyse our project aims quantitatively as a part of our integrated human practices. Big Data approach such as topic modelling could prove itself to be a powerful and useful quantitative tool for iGEM projects. From this our main aim was to analyse our hypothesis and its impact in society and in synthetic biology communities.
Topic model allows us to identify boardbroad and subtitle patterns in a particular subject that we would otherwise unable to detect due to our inability to process such a huge collection of texts . This is very useful as it provides a simple way to analyse large volumes of unlabelled text. In other words, identifying the topic on which the specific content is talking about. One of the applications is text summarisation which basically summarises the data presented to the user, and an example of the technology that utilises this would be search engines such as Google.
Background – What is Topic Modelling?
Topic modelling is one of many applications of machine learning which is a type of artificial intelligence (AI) that grants computers with the ability to learn without being explicitly programmed . It looks for patterns in data but also uses the extracted data to detect patterns in data and adjust its program actions accordingly. There are two main groups of machine learning applications: Supervised learning and unsupervised learning.
As seen in figure 1, if the datasets contain labels it is classed as supervised learning. Under supervised learning if the label is categorical it is talking about classification and if the label is quantitative it is talking about regression . In unsupervised learning there are no such labels therefore clustering of the datasets or finding latent variables/structure in the datasets is performed .
Clustering or cluster analysis groups a set of similar objects closer together, in comparison to other objects in the same set of data . To get more general understanding here are the characteristics of topic models:
- Exploratory: to discover, browse or search large collections of unlabelled text data
- Latent variables: extracting hidden thematic structure (not labelled). It could be abstract topics, which could be a cluster of words and together they have some kind of meaning
- Clustering: cluster of words and collection of documents
Types of Clustering and Topic Model Clustering
In machine learning there are actually different kinds of clustering:
Hard clustering is where each instance of your dataset can only belong to one cluster as there are hard division on lines. Hierarchical clustering includes sub clusters in the cluster itself where the sub clusters have ‘child and parent’ like relationships. Soft/Fuzzy clustering is used by topic modelling. Each instance can belong to each cluster with certain degree. Certain degree can be thought of as a percentage or a probability.
Understanding the Machine Learning Algorithm
It is important to know what a program does before asking how it works. The black box in figure 3 represents the algorithm of a topic model. As illustrated above, input is going to be a collection of text documents and there will be three outputs:
- Cluster of words where each cluster will define a topic
- Frequency of the words appearing in the topics
- Distribution of topics in documents. For an example, politic and cultures might appear in the same document and such.
Latent Dirichlet Allocation
One of the popular algorithm in topic modelling is called Latent Dirichlet Allocation, commonly known as LDA. However, there is another model (not topic) called LDA but it stands for Linear Discriminant analysis which is used for supervised learning. Hence, these two LDAs have no relation to each other therefore it is important to be clear about which LDA model is being referred to early. And of course we are only concerned with Latent Dirichlet Allocation.