ARCHIVED: Text analysis algorithms supported by CyberDH at IU

This content has been archived, and is no longer maintained by Indiana University. Information here may no longer be accurate, and links may no longer be available or reliable.

At Indiana University, the Research Technologies Cyberinfrastructure for Digital Humanities and Creative Activities (CyberDH) group supports text analysis primarily through the use of Jupyter Notebooks with scripts coded in Python and/or R. These notebooks are annotated for beginning coders. Third-party tools also are supported, including Cytoscape for creating network graphs, InPhO for topic modeling, and Voyant and AntConc for more general text analysis needs, such as word frequency or finding keywords in context (KWIC).

Algorithms	Availability	Output
Topic modeling
LDA: Latent Dirichlet Allocation (LDA) is used to find topics in the dataset and to determine which topics are most associated with which chunks (documents, lines, sentences, etc.) of your dataset. Currently, LDA is available only as a Jupyter Notebook and is coded only in Python.	GitHub	Tables, stacked histogram, interactive graph that includes a bubble chart with a histogram
LSA: Latent Semantic Analysis (LSA) is used to compare documents to one another and to determine which documents are most similar to each other. Currently, LSA is available only as a Jupyter Notebook and is coded only in Python.	GitHub	Table, heatmap
Word2Vec: Word2Vec is a group of related models used to produce word embeddings. By plugging in a word of interest, you can see what other words are most similar or appear in a similar context in your text(s). Currently, Word2Vec is available only as a Jupyter Notebook and is coded only in Python	GitHub	Table, scatter plot
Frequency counts
Ngram frequency: This determines the most frequently used ngrams (bigrams, trigrams, etc.) in a dataset. This is available in both notebook and script form, and is coded in both R and Python.	GitHub	Histogram, word cloud
Word frequency: This determines the most frequently used words in a dataset. This is available in both notebook and script form, and is coded in both R and Python.	GitHub	Histogram, word cloud
Streamgraph: This is both an algorithm and a type of graph. A streamgraph allows you to compare word usage across the corpus, so you can see where in the text the use of multiple words increased or decreased. This allows you to see where words appear more frequently within a text in relation to each other. Currently, this is available in both notebook and script form, and is coded in both R and Python.	GitHub	Streamgraph
Sentiment analysis
VADER: Valence Aware Dictionary and sEntiment Reasoner (VADER) sentiment analysis is used to determine whether opinions expressed in textual data are generally positive, negative, or neutral. The VADER algorithm was created primarily for use with social media, but has been adapted in the past for use with other forms of textual data. Currently, VADER is available only as a Jupyter Notebook and is coded only in Python.	GitHub	Bar graphs, line graph, histograms