- Notifications
You must be signed in to change notification settings - Fork0
Using digital form of the actual scripts of the 'Star Trek' science fiction series to perform interesting NLP tasks and answering some questions on Topic Modelling, Character properties and the plot as a whole.
somjit101/NLP-Star-Trek-Scripts
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Using digital form of the actual scripts of the 'Star Trek' science fiction series to perform interesting NLP tasks and answering some questions on Topic Modelling, Character properties and the plot as a whole.
The Dataset used can be foundhere on the Kaggle datasets. It consists of .json files with the full scripts of all Star Trek series processed into character lines:
Structure:
allserieslines = {seriesname : {episode number : {character : alllines } }}
e.g.allserieslines['DS9']['episode 0']['SISKO']
Text scraped fromhttp://www.chakoteya.net/StarTrek/index.html
A little bit of preliminary observation and browsing shows that 'DS9' has the least number of typos and missing space errors, making it a good quality dataset for our purposes.
Answering some questions from the given Dataset using Natural Language Proccessing and Machine Learning techniques
So, to count the no. of lines for each character, we simply count the number of entries of each character across all episodes. If we arrange them in descending order of the number of lines, and plot a bar graph, we get:
- SSKO
- KIRA
- BASHIR
- O'BRIEN
- ODO
This basically means we are interested in the character who utters the most number of words. So we count the no. of words of each character.
Again, plotting a bar graph of each character and the number of words uttered in the whole series in descending order, we obtain:
Question 3. What are the main topics each main character talks about? Define how you determined main character.
If we notice the bar plots above, we notice that there is a sharp fall in number of lines/words for characters beyond'DAX'
Hence, we assume there are7 Main Characters :
- SSKO
- KIRA
- BASHIR
- O'BRIEN
- ODO
- QUARK
- DAX
Now, we useLatent Dirichlet Allocation (LDA) for Topic Modelling of the dialogues of the main characters in the Series in an unsupervised setup.
LDA is used to classify text in a document to a particular topic, primarily useful in anunsupervised setting such as this. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.
- Each document is modeled as a multinomial distribution of topics and each topic is modeled as a multinomial distribution of words.
- LDA assumes that the every chunk of text we feed into it will contain words that are somehow related. Therefore choosing the right corpus of data is crucial.
- It also assumes documents are produced from a mixture of topics. Those topics then generate words based on their probability distribution.
More info about LDA in theoriginal paper
Note: We are assuming that there are at most 10 most frequent topics that the series is built around.
On training a multicore LDA model on our corpus with Bag-of-Words embedded text data, we obtained the 10 following topics in the Script:
- Topic 0:Actions
- Topic 1:Spaceship Commands
- Topic 2:Spaceship
- Topic 3:Extraterrestrial Life
- Topic 4:Information
- Topic 5:Intellect
- Topic 6:Leadership
- Topic 7:Emotions
- Topic 8:Feelings
- Topic 9:Time and Events
Finally, we test the dialogues of each main character to find out theTOP 2 topics they talk about throughout the series.
Based on the obtained results we see that the Main Characters talk about the following in the series:
Character | Top 2 Topics |
---|---|
SISKO | Spaceship, Time and Events |
KIRA | Extraterrestrial Life, Time and Events |
BASHIR | Intellect, Time and Events |
O'BRIEN | Time and Events, Actions |
ODO | Intellect, Spaceship |
QUARK | Emotions, Extraterrestrial Life |
DAX | Time and Events, Actions |
Based on the results as depicted on theiPython Notebook, here are the top 2 topics each of the main characters talks about with changing seasons in the series:
Character | Top 2 Topics |
---|---|
SISKO | Intellect, Time and Events |
KIRA | Time and Events, Spaceship |
BASHIR | Time and Events, Actions |
O'BRIEN | Time and Events, Spaceship Commands |
ODO | Time and Events, Intellect |
QUARK | Extraterrestrial Life, Spaceship Commands |
DAX | Time and Events, Actions |
Character | Top 2 Topics |
---|---|
SISKO | Time and Events, Spaceship |
KIRA | Time and Events, Actions |
BASHIR | Intellect, Spaceship Commands |
O'BRIEN | Time and Events, Actions |
ODO | Time and Events, Intellect |
QUARK | Extraterrestrial Life, Emotions |
DAX | Time and Events, Spaceship |
Character | Top 2 Topics |
---|---|
SISKO | Spaceship, Time and Events |
KIRA | Time and Events, Extraterrestrial Life |
BASHIR | Intellect, Actions |
O'BRIEN | Time and Events, Actions |
ODO | Spaceship, Intellect |
QUARK | Emotions, Extraterrestrial Life |
DAX | Time and Events, Actions |
Character | Top 2 Topics |
---|---|
SISKO | Spaceship, Time and Events |
KIRA | Time and Events, Extraterrestrial Life |
BASHIR | Feelings, Time and Events |
O'BRIEN | Time and Events, Intellect |
ODO | Spaceship, Feelings |
QUARK | Emotions, Intellect |
DAX | Time and Events, Feelings |
Character | Top 2 Topics |
---|---|
SISKO | Spaceship, Time and Events |
KIRA | Extraterrestrial Life, Spaceship |
BASHIR | Intellect, Time and Events |
O'BRIEN | Time and Events, Spaceship |
ODO | Time and Events, Intellect |
QUARK | Emotions, Information |
DAX | Spaceship, Time and Events |
Character | Top 2 Topics |
---|---|
SISKO | Extraterrestrial Life, Spaceship |
KIRA | Feelings, Extraterrestrial Life |
BASHIR | Leadership, Time and Events |
O'BRIEN | Time and Events, Intellect |
ODO | Intellect, Feelings |
QUARK | Emotions, Information |
DAX | Feelings, Leadership |
Character | Top 2 Topics |
---|---|
SISKO | Feelings, Spaceship |
KIRA | Leadership, Extraterrestrial Life |
BASHIR | Intellect, Leadership |
O'BRIEN | Time and Events, Actions |
ODO | Intellect, Spaceship |
QUARK | Emotions, Feelings |
DAX | Actions, Spaceship Commands |
Let us investigate the simialrities between theMain Characters in the series based on their dialogues and how these similaritieschange over seasons. A close observation of the results would reveal the influence the characters have on each over time in the Star Trek universe, which in my opinion, is a very interesting insight.
We have used theSentence BERT (Pre-trained Transformer Network - RoBERTa Large) model to encode sentences into 1024-dimensional embedded vectors.
As we can observe from the heatmap of the similarity matrices for the main characters, the similarities are lower in the first season, signifying that the characters are new and unbiased and not influenced by the others. But with increasing seasons, we notice a gradual shift towardshigher similarities between the characters, with Season 4 and Season 5 showing significantly higher average similarities. This indicates that the characters themselves begins to influence the scriptwriting as the show progresses.
Lastly, the last season, i.e.Season 7 shows an abnormally high degree of similarity between the characters, except DAX. This could be due to the following reasons:
- Since we do not have the episode scripts from the last 3 episodes, our embeddings of the dialogues are not as accurate.
- Less participation of the main characters during the last season, i.e. some dying off, eliminated, outcast etc.
- The entirety of Season 7 focusing on a limited set of plot concepts, causing heavy similarities between the dialogues of the main characters.
If I had more time, I would try to model the interactions between some of the important characters in the show by finding out the top supporting characters each of them share scenes with, most frequently. Building aconnected graph of these scene associations with the characters being vertices and the strengths of the edges being proportional to the screen time shared between any 2 of them, would allow deeper understanding of the substructures of the different scenes, settings and moods of the movie in conjunction with the topics covered by each character.
About
Using digital form of the actual scripts of the 'Star Trek' science fiction series to perform interesting NLP tasks and answering some questions on Topic Modelling, Character properties and the plot as a whole.