Archive for the ‘Text Mining and Natural Langauge Processing’ Category

Charting & graphing with Tableau Public

Friday, May 4th, 2018

They say, “A picture is worth a thousand words”, and through use of something like Tableau this can become a reality in text mining.

After extracting features from a text, you will have almost invariably created lists. Each of the items on the lists will be characterized with bits of context thus transforing the raw data into information. These lists will probably take the shape of matrices (sets of rows & columns), but other data structures exist as well, such as networked graphs. Once the data has been transformed into information, you will want to make sense of the information — turn the information into knowledge. Charting & graphing the data is one way to make that happen.

For example, the reader may have associated each word in a text with a part-of-speech, and then this association was applied across a corpus. The reader might then ask, “To what degree are words associated with each part-of-speech similar or different across items in the corpus? Do different items include similar or different personal pronouns, and therefore, are some documents more male, more female, or more gender neutral?” Alternatively, suppose the named entities have been extracted from items in a corpus, then the reader may want to know, “What countries, states, and/or cities are mentioned in the text/corpus, and to what degree? Are these texts ‘European’ in scope?”

A charting & graphing application like Tableau (or Tableau Public) can address these questions. [1] The first can be answered by enabling the reader to select one more more items from a corpus, select one or more parts-of-speech, counting & tabulating the intersected words, and displaying the result as a word cloud. The second question can be addressed similarly. Allow the reader to select items from a corpus, extract the names of places (countries, states, and cities), and plot the geographic coordinates on a global map. Once these visualizations are complete, they can be saved on the Web for others to use, for example:

word cloud map

Creating visualizations with Tableau (or Tableau Public) takes practice. Not only does the reader need to have structured data in hand, but one needs to be patient in the learning of the interface. To the author’s mind, the whole thing is reminiscent of the venerable HyperCard program from the 1980’s where one was presented with a number of “cards”, and programming interfaces were created by placing “objects” on them.

tableau interface

This workshop comes with two previously created Tableau workbooks located in the etc directory (word-clouds.twbx and maps.twbx). Describing the process to create them is beyond the scope of this workshop, but an outline follows:

  1. amass sets of data, like parts-of-speech or named entities
  2. import the data into Tableau
  3. in the case of the named entities, convert the data to “Geographic Roles”
  4. drag data elements to the Marks, Rows, or Columns cards
  5. make liberal use of the Show Me feature
  6. drag data elements to the Filters card
  7. observe the visualizations and turn your information into knowledge

Tableau is not really intended to be used against textual data/information; Tableau is more useful and more functional when applied to tabular numeric data. After all, the program is called… Tableau. This does not mean Tableau can not be exploited by the text miner. It just means it requires practice and an ability to articulate a question to be answered with the help of a visualization.

Links

[1] Tableau Public – https://public.tableau.com/

Extracting parts-of-speech and named entities with Stanford tools

Thursday, April 26th, 2018

Extracting specific parts-of-speech as well as “named entities”, and then counting & tabulating them can be quite insightful.

Parts-of-speech include nouns, verbs, adjectives, adverbs, etc. Named entities are specific types of nouns, including but not limited to, the names of people, places, organizations, dates, times, money amounts, etc. By creating features out of parts-of-speech and/or named entities, the reader can answer questions such as:

  • What is discussed in this document?
  • What do things do in this document?
  • How are things described, and how might those descriptions be characterized?
  • To what degree is the text male, female, or gender neutral?
  • Who is mentioned in the text?
  • To what places are things referring?
  • What happened in the text?

There are a number of tools enabling the reader to extract parts-of-speech, including the venerable Brill part-of-speech tagger implemented in a number of programming languages, CLAWS, the Apache OpenNLP, and a specific part of the Stanford NLP suite of tools called the Stanford Log-linear Part-Of-Speech Tagger. [1] Named entities can be extracted with the Stanford Named Entity Recognizer (NER). [2] This workshop exploits the Standford tools.

The Stanford Log-linear Part-Of-Speech Tagger is written in Java, making it a bit difficult for most readers to use in the manner it was truly designed, the author included. Luckily, the distribution comes with a command-line interface allowing the reader to use the tagger sans any Java programing. Because any part-of-speech or named entity extraction application is the result of a machine learning process, it is necessary to use a previously created computer model. The Stanford tools comes quite a few models from which to choose. The command-line interface also enables the reader to specify different types of output: tagged, XML, tab-delimited, etc. Because of all these options, and because the whole thing uses Java “archives” (read programming libraries or modules), the command-line interface is daunting, to say the least.

After downloading the distribution, the reader ought to be able to change to the bin directory, and execute either one of the following commands:

  • $ stanford-postagger-gui.sh
  • > stanford-postagger-gui.bat

The result will be a little window prompting for a sentence. Upon entering a sentence, tagged output will result. This is a toy interface, but demonstrates things quite nicely.

pos gui

The full-blown command-line interface is bit more complicated. From the command-line one can do either of the following, depending on the operating system:

  • $ stanford-postagger.sh models/english-left3words-distsim.tagger walden.txt
  • > stanford-postagger.bat models\english-left3words-distsim.tagger walden.txt

The result will be a long stream of tagged sentences, which I find difficult to parse. Instead, I prefer the inline XML output, which is much more difficult to execute but much more readable. Try either:

  • $ java -cp stanford-postagger.jar: edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/english-left3words-distsim.tagger -outputFormat inlineXML -outputFormatOptions lemmatize -textFile walden.txt
  • > java -cp stanford-postagger.jar: edu.stanford.nlp.tagger.maxent.MaxentTagger -model models\english-left3words-distsim.tagger -outputFormat inlineXML -outputFormatOptions lemmatize -textFile walden.txt

In these cases, the result will be a long string of ill-formed XML. With a bit of massaging, this XML is much easier to parse with just about any compute programming language, believe it or not. The tagger can also be run in server mode, which makes batch processing a whole lot easier. The workshop’s distribution comes a server and client application for exploiting these capabilities, but, unfortunately, they won’t run on Windows computers unless some sort of Linux shell has been installed. Some people can issue the following command to launch the server from the workshop’s distribution:

$ ./bin/pos-server.sh

The reader can run the client like this:

$ ./bin/pos-client.pl walden.txt

The result will be a well-formed XML file, which can be redirected to a file, processed by another script converting it into a tab-delimited file, and finally saved to a second file for reading by a spreadsheet, database, or data analysis tool:

$ ./bin/pos-client.pl walden.txt > walden.pos; ./bin/pos2tab.pl walden.pos > walden.tsv

For the purposes of this workshop, the whole of the harvested data has been pre-processed with the Stanford Log-linear Part-Of-Speech Tagger. The result is been mirrored in the parts-of-speech folder/directory. The reader can open the files in the parts-of-speech folder/directory for analysis. For example, you might open them in OpenRefine and try to see what verbs appear most frequently in a given document. My guess the answer will be the lemmas “be” or “have”. The next set of most frequently used verb lemmas will probably be more indicative of the text.

The process of extrating features of name entities is very similar with the Stanford NER. The original Stanford NER distribution comes with a number of jar files, models, and configuration/parameter files. After downloading the distribution, the reader can run a little GUI application, import some text, and run NER. The result will look something like this:

ner gui

The simple command-line interface takes a single file as input, and it outputs a stream of tagged sentences. For example:

  • $ ner.sh walden.txt
  • > ner.bat walden.txt

Each tag denotes an entity (i.e. the name of a person, the name of a place, the name of an organization, etc.). Like the result of all machine learning algorithms, the tags are not necessarily correct, but upon closer examination, most of them are pretty close. Like the POS Tagger, this workshop’s distribution comes with a set of scripts/programs that can make the Stanford NER tool locally available as a server. It also comes with a simple client to query the server. Like the workshop’s POS tool, the reader (with a Macintosh or Linux computer) can extract named entities all in two goes:

$ ./bin/pos-server.sh
$ ./bin/pos-client.pl walden.txt > walden.ner; ./bin/pos2tab.pl walden.ner > walden.tsv

Like the workshop’s pre-processed part-of-speech files, the workshop’s corpus has been pre-processed with the NER tool. The preprocessed files ought to be in a folder/directory named… named-entities. And like the parts-of-speech files, the “ner” files are tab-delimited text files readable by spreadsheets, databases, OpenRefine, etc. For example, you might open one of them in OpenRefine and see what names of people trend in a given text. Try to create a list of places (which is not always easy), export them to a file, and open them with Tabeau Public for the purposes of making a geographic map.

Extracting parts-of-speech and named entities straddles simple text mining and natural language processing. Simple text mining is often about counting & tabulating features (words) in a text. These features have little context sans proximity to other features. On the other hand, parts-of-speech and named entities denote specific types of things, namely specific types of nouns, verbs, adjectives, etc. While these thing do not necessarily denote meaning, they do provide more context than simple features. Extracting parts-of-speech and named entities is (more or less) a easy text mining task with more benefit than cost. Extracting parts-of-speech and named entities goes beyond the basics.

Links

  1. [1] Stanford Log-linear Part-Of-Speech Tagger – https://nlp.stanford.edu/software/tagger.shtml
  2. [2] Stanford Named Entity Recognizer (NER) – https://nlp.stanford.edu/software/CRF-NER.shtml

Creating a plain text version of a corpus with Tika

Wednesday, April 25th, 2018

It is imperative to create plain text versions of corpus items.

Text mining can not be done without plain text data. This means HTML files need to be rid of markup. It means PDF files need to have been “born digitally” or they need to have been processed with optical character recognition (OCR), and then the underlying text needs to be extracted. Word processor files need to converted to plain text, and the result saved accordingly. The days of plain o’ ASCII text files need to be forgotten. Instead, the reader needs to embrace Unicode, and whenever possible, make sure characters in the text files are encoded as UTF-8. With UTF-8 encoding, one gets all of the nice accent marks so foreign to United States English, but one also gets all of the pretty emoticons increasingly sprinkling our day-to-day digital communications. Moreover, the data needs to be as “clean” as possible. When it comes to OCR, do not fret too much. Given the large amounts of data the reader will process, “bad” OCR (OCR with less than 85% accuracy) can still be quite effective.

Converting harvested data into plain text used to be laborious as well as painful, but then a Java application called Apache Tika came on the scene. [1] Tika comes in two flavors: application and server. The application version can take a single file as input, and it can output metadata as well as any underlying text. The application can also work in batch mode taking a directory as input and saving the results to a second directory. Tika’s server version is much more expressive, more powerful, and very HTTP-like, but it requires more “under the hood” knowledge to exploit to its fullest potential.

For the sake of this workshop, versions of the Tika application and Tika server are included in the distribution, and they have been saved in the lib directory with the names tika-desktop.jar and tika-server.jar. The reader can run the desktop/GUI version of the Tika application by merely double-clicking on it. The result will be a dialog box.

tika
result

Drag a PDF (or just about any) file on to the window, and Tika will extract the underlying text. To use the command-line interface, something like this could be run to output the help text:

  • $ java -jar ./lib/tika-desktop.jar --help
  • > java -jar .\lib\tika-desktop.jar --help

And then something like these commands to process a single file or a whole directory of files:

  • $ java -jar ./lib/tika-desktop.jar -t <filename>
  • $ java -jar ./lib/tika-desktop.jar -t -i <input directory> -o <output directory>
  • > java -jar .\lib\tika-desktop.jar -t <filename>
  • > java -jar .\lib\tika-desktop.jar -t -i <input directory> -o <output directory>

Try transforming a few files individually as well as in batch. What does the output look like? To what degree is it readable? To what degree has the formatting been lost? Text mining does not take formatting into account, so there is no huge loss in this regard.

Without some sort of scripting, the use of Tika to convert harvested data into plain text can still be tedious. Consequently, the whole of the workshop’s harvested data has been pre-processed with a set of Perl and bash scripts (which probably won’t work on Windows computers unless some sort of Linux shell has been installed):

  • $ ./bin/tika-server.sh – runs Tika in server mode on TCP port 8080, and waits patiently for incoming connections
  • $ ./bin/tika-client.pl – takes a file as input, sends it to the server, and returns the plain text while handling the HTTP magic in the middle
  • $ ./bin/file2txt.sh – a front-end to the second script taking a file and directory name as input, transforming the file into plain text, and saving the result with the same name but in the given directory and with a .txt extension

The entirety of the harvested data has been transformed into plain text for the purposes of this workshop. (“Well, almost all.”) The result has been saved in the folder/directory named “corpus”. Peruse the corpus directory. Compare & contrast its contents with the contents of the harvest directory. Can you find any ommisions, and if so, then can you guess why/how they occurred?

Links

Identifying themes and clustering documents using MALLET

Wednesday, April 25th, 2018

Topic modeling is an unsupervised machine learning process. It is used to create clusters (read “subsets”) of documents, and each cluster is characterized by sets of one or more words. Topic modeling is good at answering questions like, “If I were to describe this collection of documents in a single word, then what might that word be? How about two?” or make statements like, “Once I identify clusters of documents of interest, allow me to read/analyze those documents in greater detail.” Topic modeling can also be used for keyword (“subject”) assignment; topics can be identified and then documents can be indexed using those terms. In order for a topic modeling process to work, a set of documents first needs to be assembled. The topic modeler then, at the very least, takes an integer as input, which denotes the number of topics desired. All other possible inputs can be assumed, such as use of a stop word list or denoting the number of time the topic modeler ought to internally run before it “thinks” it has come the best conclusion.

MALLET is the grand daddy of topic modeling tools, and it supports other functions such as text classification and parsing. [1] It is essentially a set of Java-based libraries/modules designed to be incorporated into Java programs or executed from the command line.

A subset of MALLET’s functionality has been implemented in a program called topic-modeling-tool, and the tool bills itself as “A GUI for MALLET’s implementation of LDA.” [2] Topic-modeling-tool provides an easy way to read what possible themes exist in a set of documents or how the documents might be classified. It does this by creating topics, displaying the results, and saving the data used to create the results for future use. Here’s one way:

  1. Create a set of plain text files, and save them in a single directory.
  2. Run/launch topic-modeling-tool.
  3. Specify where the set of plain text files exist.
  4. Specify where the output will be saved.
  5. Denote the number of topics desired.
  6. Execute the command with “Learn Topics”.

The result will be a set of HTML, CSS, and CSV files saved in the output location. The “answer” can also be read in the tool’s console.

A more specific example is in order. Here’s how to answer the question, “If I were describe this corpus in a single word, then what might that one word be?”:

  1. Repeat Steps #1-#4, above.
  2. Specify a single topic to be calculated.
  3. Press “Optional Settings…”.
  4. Specify “1” as the number of topic words to print.
  5. Press okay.
  6. Execute the command with “Learn Topics”.
topic model tool
topics

What one word can be used to describe your collection?

Iterate the modeling process by slowly increasing the number of desired topics and number of topic words. Personally, I find it interesting to implement a matrix of topics to words. For example, start with one topic and one word. Next, denote two topics with two words. Third, specify three topics with three words. Continue the process until the sets of words (“topics”) seem to make intuitive sense. After a while you may observe clear semantic distinctions between each topic as well as commonalities between each of the topic words. Distinctions and commonalities may include genders, places, names, themes, numbers, OCR “mistakes”, etc.

Links

Introduction to the NLTK

Wednesday, April 25th, 2018

The venerable Python Natural Language Toolkit (NLTK) is well worth the time of anybody who wants to do text mining more programmatically. [0]

For much of my career, Perl has been the language of choice when it came to processing text, but in the recent past it seems to have fallen out of favor. I really don’t know why. Maybe it is because so many other computer languages have some into existence in the past couple of decades: Java, PHP, Python, R, Ruby, Javascript, etc. Perl is more than capable of doing the necessary work. Perl is well-supported, and there are a myriad of supporting tools/libraries for interfacing with databases, indexers, TCP networks, data structures, etc. On the other hand, few people are being introduced to Perl; people are being introduced to Python and R instead. Consequently, the Perl community is shrinking, and the communities for other languages is growing. Writing something in a “dead” language is not very intelligent, but that may be over-stating the case. On the other hand, I’m not going to be able to communicate with very many people if I speak Latin and everybody else is speaking French, Spanish, or German. It behooves the reader to write software in a language apropos to the task as well as a langage used by many others.

Python is a good choice for text mining and natural language processing. The Python NLTK provides functionality akin to much of what has been outlined in this workshop, but it goes much further. More specifically, it interaces with WordNet, a sort of thesaurus on steroids. It interfaces with MALLET, the Java-based classification & topic modeling tool. It is very well-supported and continues to be maintained. Moreover, Python is mature in & of itself. There are a host of Python “distributions/frameworks”. There are any number of supporting libraries/modules for interfacing with the Web, databases & indexes, the local file system, etc. Even more importantly for text mining (and natural language processing) techniques, Python is supported by a set of robust machine learning libraries/modules called scikit-learn. If the reader wants to write text mining or natural language processing applications, then Python is really the way to go.

In the etc directory of this workshop’s distribution is a “Jupyter Notebook” named “An introduction to the NLTK.ipynb”. [1] Notebooks are sort of interactive Python interfaces. After installing Jupyter, the reader ought to be able to run the Notebook. This specific Notebook introduces the use of the NLTK. It walks you through the processes of reading a plain text file, parsing the file into words (“features”). Normalizing the words. Counting & tabulating the results. Graphically illustrating the results. Finding co-occurring words, words with similar meanings, and words in context. It also dabbles a bit into parts-of-speech and named entity extraction.

notebook

The heart of the Notebook’s code follows. Given a sane Python intallation, one can run this proram by saving it with a name like introduction.py, saving a file named walden.txt in the same directory, changing to the given directory, and then running the following command:

python introduction.py

The result ought to be a number of textual outputs in the terminal window as well as a few graphics.

Errors may occur, probably because other Python libraries/modules have not been installed. Follow the error messages’ instructions, and try again. Remember, “Your milage may vary.”

# configure; using an absolute path, define the location of a plain text file for analysis
FILE = 'walden.txt'

# import / require the use of the Toolkit
from nltk import *

# slurp up the given file; display the result
handle = open( FILE, 'r')
data   = handle.read()
print( data )

# tokenize the data into features (words); display them
features = word_tokenize( data )
print( features )

# normalize the features to lower case and exclude punctuation
features = [ feature for feature in features if feature.isalpha() ]
features = [ feature.lower() for feature in features ]
print( features )

# create a list of (English) stopwords, and then remove them from the features
from nltk.corpus import stopwords
stopwords = stopwords.words( 'english' )
features  = [ feature for feature in features if feature not in stopwords ]

# count & tabulate the features, and then plot the results -- season to taste
frequencies = FreqDist( features )
plot = frequencies.plot( 10 )

# create a list of unique words (hapaxes); display them
hapaxes = frequencies.hapaxes()
print( hapaxes )

# count & tabulate ngrams from the features -- season to taste; display some
ngrams      = ngrams( features, 2 )
frequencies = FreqDist( ngrams )
frequencies.most_common( 10 )

# create a list each token's length, and plot the result; How many "long" words are there?
lengths = [ len( feature ) for feature in features ]
plot    = FreqDist( lengths ).plot( 10 )

# initialize a stemmer, stem the features, count & tabulate, and output
from nltk.stem import PorterStemmer
stemmer     = PorterStemmer()
stems       = [ stemmer.stem( feature ) for feature in features ]
frequencies = FreqDist( stems )
frequencies.most_common( 10 )

# re-create the features and create a NLTK Text object, so other cool things can be done
features = word_tokenize( data )
text     = Text( features )

# count & tabulate, again; list a given word -- season to taste
frequencies = FreqDist( text )
print( frequencies[ 'love' ] )

# do keyword-in-context searching against the text (concordancing)
print( text.concordance( 'love' ) )

# create a dispersion plot of given words
plot = text.dispersion_plot( [ 'love', 'war', 'man', 'god' ] )

# output the "most significant" bigrams, considering surrounding words (size of window) -- season to taste
text.collocations( num=10, window_size=4 )

# given a set of words, what words are nearby
text.common_contexts( [ 'love', 'war', 'man', 'god' ] )

# list the words (features) most associated with the given word
text.similar( 'love' )

# create a list of sentences, and display one -- season to taste
sentences = sent_tokenize( data )
sentence  = sentences[ 14 ]
print( sentence )

# tokenize the sentence and parse it into parts-of-speech, all in one go
sentence = pos_tag( word_tokenize( sentence ) )
print( sentence )

# extract named enities from a sentence, and print the results
entities = ne_chunk( sentence )
print( entities )

# done
quit()

Links

Using Voyant Tools to do some “distant reading”

Tuesday, April 24th, 2018

Voyant Tools is often the first go-to tool used by either: 1) new students of text mining and the digital humanities, or 2) people who know what kind of visualization they need/want. [1] Voyant Tools is also one of the longest supported tools described in this bootcamp.

As stated the Tool’s documentation: “Voyant Tools is a web-based text reading and analysis environment. It is a scholarly project that is designed to facilitate reading and interpretive practices for digital humanities students and scholars as well as for the general public.” To that end it offers a myriad of visualizations and tabular reports characterizing a given text or texts. Voyant Tools works quite well, but like most things, the best use comes with practice, a knowledge of the interface, and an understanding of what the reader wants to express. To all these ends, Voyant Tools counts & tabulates the frequencies of words, plots the results in a number of useful ways, supports topic modeling, and the comparison documents across a corpus. Examples include but are not limited to: word clouds, dispersion plots, networked analysis, “stream graphs”, etc.

dispersion
dispersion chart
network
network diagram
stream
“stream” chart
word cloud
word cloud
concordance
concordance
topic model
topic modeling

Voyant Tools’ initial interface consists of six panes. Each pain encloses a feature/function of Voyant. In the author’s experience, Voyant Tools’ is better experienced by first expanding one of the panes to a new window (“Export a URL”), and then deliberately selecting one of the tools from the “window” icon in the upper left-hand corner. There will then be displayed a set of about two dozen tools for use against a document or corpus.

initial layout
initial layout
focused layout
focused layout

Using Voyant Tools the reader can easily ask and answer the following sorts of questions:

  • What words or phrases appear frequently in this text?
  • How do those words trend throughout the given text?
  • What words are used in context with a given word?
  • If the text were divided into T topics, then what might those topics be?
  • Visually speaking, how do given texts or sets of text cluster together?

After a more thorough examination of the reader’s corpus, and after making the implicit more explicit, Voyant Tools can be more informative. Randomly clicking through its interface is usually daunting to the novice. While Voyant Tools is easy to use, it requires a combination of text mining knoweldge and practice in order to be used effectively. Only then will useful “distant” reading be done.

[1] Voyant Tools – https://voyant-tools.org/

Using a concordance (AntConc) to facilitate searching keywords in context

Monday, April 23rd, 2018

A concordance is one of the oldest of text mining tools dating back to at least the 13th century when they were used to analyze and “read” religious texts. Stated in modern-day terms, concordances are key-word-in-context (KWIC) search engines. Given a text and a query, concordances search for the query in the text, and return both the query as well as the words surrounding the query. For example, a query for the word “pond” in a book called Walden may return something like the following:

  1.    the shore of Walden Pond, in Concord, Massachuset
  2.   e in going to Walden Pond was not to live cheaply 
  3.    thought that Walden Pond would be a good place fo
  4.    retires to solitary ponds to spend it. Thus also 
  5.    the woods by Walden Pond, nearest to where I inte
  6.    I looked out on the pond, and a small open field 
  7.   g up. The ice in the pond was not yet dissolved, t
  8.   e whole to soak in a pond-hole in order to swell t
  9.   oping about over the pond and cackling as if lost,
  10.  nd removed it to the pond-side by small cartloads,
  11.  up the hill from the pond in my arms. I built the 

The use of a concordance enables the reader to learn the frequency of the given query as well as how it is used within a text (or corpus).

Digital concordances offer a wide range of additional features. For example, queries can be phrases or regular expressions. Search results and be sorted by the words on the left or on the right of the query. Queries can be clustered by the proximity of their surrounding words, and the results can be sorted accordingly. Queries and their nearby terms can be scored not only by their frequencies but also by the probability of their existence. Concordances can calculate the postion of a query i a text and illustrate the result in the form of a dispersion plot or histogram.

AntConc screen shotAntConc is a free, cross-platform concordance program that does all of the things listed above, as well as a few others. [1] The interface is not as polished as some other desktop applications, and sometimes the usability can be frustrating. On the other hand, given practice, the use of AntConc can be quite illuminating. After downloading and running AntConc, give these tasks a whirl:

  • use the File menu to open a single file
  • use the Word List tab to list token (word) frequencies
  • use the Settings/Tool Preferences/Word List Category to denote a set of stop words
  • use the Word List tab to regenerate word frequencies
  • select a word of interest from the frequency list to display the KWIC; sort the result
  • use the Concordance Plot tab to display the dispersion plot
  • select the Collocates tab to see what words are near the selected word
  • sort the collocates by frequency and/or word; use the result to view the concordance

The use of a concordance is often done just after the creation of a corpus. (Remember, a corpus can include one or more text files.) But the use of a concordance is much more fruitful and illuminating if the features of a corpus are previously made explicit. Concordances know nothing about parts-of-speech nor grammer. Thus they have little information about the words they are analyzing. To a concordance, every word is merely a token — the tiniest bit of data. Whereas features are more akin to information because they have value. It is better to be aware of the information at your disposal as opposed to simple data. Do not rush to the use of a concordance before you have some information at hand.

[1] AntConc – http://www.laurenceanthony.net/software/antconc/

Word clouds with Wordle

Sunday, April 22nd, 2018

wordsA word cloud, or sometimes called a “tag cloud” is a fun, easy, and popular way to visualize the characteristics of a text. Usually used to illustrate the frequency of words in a text, a word clouds make some features (“words”) bigger than others, sometimes colorize the features, and amass the result in a sort of “bag of words” fashion.

Many people disparage the use of word clouds. This is probably because word clouds may have been over used. The characteristics they illustrate are sometimes sophomoric. Or too much value has been given to their meaning. Despite these facts, a word cloud is an excellent way to initialize the analysis of texts.

There are many word cloud applications and programming libraries, but Wordle is probably the easiest to use as well as the most popular. † [1] To get started, use your Web browser and go to the Wordle site. Click the Create tab and type some text into the resulting text box. Submit the form. Your browser may ask for permissions to run a Java application, and if granted, the result ought to be simple word cloud. The next step is to play with Wordle’s customizations: fonts, colors, layout, etc. To begin doing useful analysis, open a file from the workshop’s corpus, and copy/paste it into Wordle. What does the result tell you? Copy/paste a different file into Wordle and then compare/contrast the two word clouds.

placesBy default, Wordle make effort to normalize the input. It removes stop words, lower-cases letters, removes numbers, etc. Wordle then counts & tabulates the frequencies of each word to create the visualization. But the frequency of words only tells one part of a text’s story. There are other measures of interest. For example, the reader might want to create a word cloud of ngram frequencies, the frequencies of parts-of-speech, or even the log-likelihood scores of significant words. To create the sorts of visualization as word clouds, the reader must first create a colon-delimited list of features/scores, and then submit them under Wordle’s Advanced tab. The challenging part of this process is created the list of features/scores, and the process can be done using a combination of the tools described in the balance of the workshop.

† Since Wordle is a Web-based Java application, it is also a good test case to see whether or not Java is installed and configured on your desktop computer.

[1] Wordle – http://www.wordle.net

An introduction to the NLTK: A Jupyter Notebook

Thursday, April 12th, 2018

The attached file introduces the reader to the Python Natural Langauge Toolkit (NLTK).

The Python NLTK is a set of modules and corpora enabling the reader to do natural langauge processing against corpora of one or more texts. It goes beyond text minnig and provides tools to do machine learning, but this Notebook barely scratches that surface.

This is my first Python Jupyter Notebook. As such I’m sure there will be errors in implementation, style, and functionality. For example, the Notebook may fail because the value of FILE is too operating system dependent, or the given file does not exist. Other failures may/will include the lack of additional modules. In these cases, simply read the error messages and follow the instructions. “Your mileage may vary.”

That said, through the use of this Notebook, the reader ought to be able to get a flavor for what the Toolkit can do without the need to completly understand the Python language.

What is text mining, and why should I care?

Wednesday, March 28th, 2018

[This is the first of a number of postings on the topic of text mining. More specifically, this is the first draft of an introductory section of a hands-on bootcamp scheduled for ELAG 2018. As I write the bootcamp’s workbook, I hope to post things here. Your comments are most welcome. –ELM]

Amarylis (1 of 3)Text mining is a process used to identify, enumerate, and analyze syntactic and semantic characteristics of a corpus, where a corpus is a collection of documents usually in the form of plain text files. The purpose of this process it to bring to light previously unknown facts, look for patterns & anomalies in the facts, and ultimately have a better understanding of the corpora as a whole.

The simplest of text mining processes merely count & tabulate a document’s “tokens” (usually words but sometimes syllables). The counts & tabulations are akin to the measurements and observations made in the physical and social sciences. Statistical methods can then be applied to the observations for the purposes of answering questions such as:

  • What is the average length of documents in the collection, and do they exhibit a normal distribution?
  • What are the most common words/phrases in a document?
  • What are the most common words/phrases in a corpus?
  • What are the unique words/phrases in a document?
  • What are the infrequent words/phrases in a corpus?
  • What words/phrases exist in every document and to what extent?
  • Where do given words/phrases appear in a text?
  • What other words surround a given word/phrase?
  • What words/phrases are truly representative of a document or corpus?
  • If a document or corpus where to be described in a single word, then what would that word be? How about described in three words? How about describing a document with three topics where each topic is denoted with five words?

The answers to these questions bring to light a corpus’s previously unknown features enabling the reader to use & understand a corpus more fully. Given the answers to these sorts of questions, a person can learn when Don Quixote actually tilts at windmills, to what degree does Thoreau’s Walden use the word “ice” in the same breath as “pond”, or how has the defintion of “scientific practice” has evolved over time?

Given models created from the results of natural language processing, other characteristics (sentences, parts-of-speech, named entities, etc.) can be parsed. These values can also be counted & tabulated enabling the reader to answer new sets of questions:

  • How difficult is a document to read?
  • What is being discussed in a corpus? To what degree are the things the names of people, organizations, places, dates, money amounts, etc? What percentage of the personal pronouns are male, female, or neutral?
  • What is the action in a corpus? What things happen in a document? Are things explained? Said? Measured?
  • How are things in the corpus described? Overall, are the connotations positive or negative? Do the connotations evolve within a document?

The documents in a corpus are often associated with metadata such as authors, titles, dates, subjects/keywords, numeric rankings, etc. This metadata can be combined with measurements & observations to answer questions like:

  • How have the use of specific words/phrases waxed & waned over time?
  • To what degree do authors write about a given concept?
  • What are the significant words/phrases described with a given genre?
  • Are there correlations between words/phrases and given document’s usefulness score?

Again, text mining is a process, and the process usually includes the following steps:

  1. Articulating a research question
  2. Amassing a corpus to study
  3. Coercing the corpus into a form amenable to computer processing
  4. Taking measurements and making observations
  5. Analyzing the results and drawing conclusions

Articulating a research question can be as informally stated as, “I’d like to know more about this corpus” or “I’d like to garner an overview of the corpus before I begin reading it in earnest.” On the other hand, articulating a research question can be as formal as a dissertation’s thesis statement. The purpose of articulating a research question — no matter how formal — is to give you a context for your investigations. Knowing a set of questions to answer helps you determine what tools you will employ in your inquires.

Amarylis (2 of 3)Creating a corpus is not always as easy as you might think. The corpus can be as small as a single document, or as large as millions. The “documents” in the corpus can be anything from tweets from a Twitter feed, Facebook postings, survey comments, magazine or journal articles, reference manuals, books, screen plays, musical lyrics, etc. The original documents may have been born digital or not. If not, then they will need to be digitized in one way or another. It is better if each item in the corpus is associated with metadata, such as authors, titles, dates, keywords, etc. Actually obtaining the documents may be impeded by copyrights, licensing restrictions, or hardware limitations. Once the corpus is obtained, it is useful to organize it into a coherent whole. There is a lot of possible for when it comes to corpus creation.

Coercing a corpus into a form amenable to computer processing is a chore in an of itself. In all cases, the document’s text needs to be in “plain” text. These means the document includes only characters, numbers, punctuation marks, and a limited number of symbols. Plain text files include no graphical formatting. No bold. No italics, no “codes” denoting larger or smaller fonts, etc. Documents are usually manifested as files on a computer’s file system. The files are usually brought together as lists, and each item in the list have many attributes — the metadata describing each item. Furthermore, each document may need to be normalized, and normalization may include changing the case of all letters to lower case, parsing the document into words (usually called “features”), identifying the lemmas or stems of a word, eliminating stop/function words, etc. Coercing your corpus into coherent whole is not to be underestimated. Remember the old adage, “Garbage in, garbage out.”

Ironically, taking measurements and making observations is the easy part. There are a myriad of tools for this purpose, and the bulk of this workshop describes how to use them. One important note: it is imperative to format the measurements and observations in a way amenable to analysis. This usually means a tabular format where each column denotes a different observable characteristic. Without formating measurements and observation in tabular formats, it will be difficult to chart and graph any results.

Analyzing the results and drawing conclusions is the subprocess of looping back to Step #1. It is where you attempt to actually answer the questions previously asked. Keep in mind that human interpretation is a necessary part of this subprocess. Text mining does not present you with truth, only facts. It is up to you to interpret the facts. For example, suppose the month is January and the thermometer outside reads 32º Fahrenheit (0º Centigrade), then you might think nothing is amiss. On the other hand, suppose the month is August, and the thermometer still reads 32º, then what might you think? “It is really cold,” or maybe, “The thermometer is broken.” Either way, you bring context to the observations and interpret things accordingly. Text mining analysis works in exactly the same way.

Amarylis (3 of 3)Finally, text mining is not a replacement for the process of traditional reading. Instead, it ought to be considered as complementary, supplemental, and a natural progression of traditional reading. With the advent of ubiquitous globally networked computers, the amount of available data and information continues to grow exponentially. Text mining provides a means to “read” massive amounts of text quickly and easily. The process is akin the inclusion of now-standard parts of a scholarly book: title page, verso complete with bibliographic and provenance metadata, table of contents, preface, introduction, divisions into section and chapters, footnotes, bibliography, and a back-of-the-book index. All of these features make a book’s content more accessible. Text mining processes applied to books is the next step in accessibility. Text mining is often described as “distant” or “scalable” reading, and it is often contrasted with the “close” reading. This is a false dichotomy, but only after text mining becomes more the norm will the dichotomy fade.

All that said, the totality of this hands-on workshop is based on the following outline:

  1. What is text mining, and why should I care?
  2. Creating a corpus
  3. Creating a plain text version of a corpus with Tika
  4. Using Voyant Tools to do some “distant” reading
  5. Using a concordance, like AntConc, to facilitate searching keywords in context
  6. Creating a simple word list with a text editor
  7. Cleaning & analyzing word lists with OpenRefine
  8. Charting & graphing word lists with Tableau Public
  9. Increasing meaning by extracting parts-of-speech with the Standford POS Tagger
  10. Increasing meaning by extracting named entities with the Standford NER
  11. Identifying themes and clustering documents using MALLET

By the end of the workshop you will have increased your ability to:

  • identify patterns, anomalies, and trends in a corpus
  • practice both “distant” and “scalable” reading
  • enhance & complement your ability to do “close” reading
  • use & understand any corpus of poetry or prose

The workshop is operating system agnostic, and all the software is freely available on the ‘Net, or already installed on your computer. Active participation requires zero programming, but readers must bring their own computer, and they must be willing to learn how to use a text editor such as NotePad++ or BBEdit. NotePad, WordPad and TextEdit are totally insufficient.

How to do text mining in 69 words

Tuesday, August 15th, 2017

Doing just about any type of text mining is a matter of: 0) articulating a research question, 1) acquiring a corpus, 2) cleaning the corpus, 3) coercing the corpus into a data structure one’s software can understand, 4) counting & tabulating characteristics of the corpus, and 5) evaluating the results of Step #4. Everybody wants to do Step #4 & #5, but the initial steps usually take more time than desired.

painting

Some automated analysis of Richard Baxter’s works

Saturday, June 13th, 2015

baxter

This page describes a corpus named baxter. It is a programmatically generated report against the full text of all the writing of Richard Baxter (a English Puritan church leader, poet, and hymn-writer) as found in Early English Books Online. It was created using a (fledgling) tool called the EEBO Workset Browser.

General statistics

An analysis of the corpus’s metadata provides an overview of what and how many things it contains, when things were published, and the sizes of its items:

  • Number of items – 140
  • Publication date range – 1650 to 1697 (histogram : boxplot)
  • Sizes in pages – 1 to 1258 (histogram : boxplot)
  • Total number of pages – 33507
  • Average number of pages per item – 239

Possible correlations between numeric characteristics of records in the catalog can be illustrated through a matrix of scatter plots. As you would expect, there is almost always a correlation between pages and number of words. Are others exist? For more detail, browse the catalog.

Notes on word usage

By counting and tabulating the words in each item of the corpus, it is possible to measure additional characteristics:

Perusing the list of all words in the corpus (and their frequencies) as well as all unique words can prove to be quite insightful. Are there one or more words in these lists connoting an idea of interest to you, and if so, then to what degree do these words occur in the corpus?

To begin to see how words of your choosing occur in specific items, search the collection.

Through the creation of locally defined “dictionaries” or “lexicons”, it is possible to count and tabulate how specific sets of words are used across a corpus. This particular corpus employs three such dictionaries — sets of: 1) “big” names, 2) “great” ideas, and 3) colors. Their frequencies are listed below:

The distribution of words (histograms and boxplots) and the frequency of words (wordclouds), and how these frequencies “cluster” together can be illustrated:

Items of interest

Based on the information above, the following items (and their associated links) are of possible interest:

  • Shortest item (1 p.) – Short instructions for the sick: Especially who by contagion, or otherwise, are deprived of the presence of a faithfull pastor. / By Richard Baxter. (TEI : HTML : plain text)
  • Longest item (1258 p.) – A Christian directory, or, A summ of practical theologie and cases of conscience directing Christians how to use their knowledge and faith, how to improve all helps and means, and to perform all duties, how to overcome temptations, and to escape or mortifie every sin : in four parts … / by Richard Baxter. (TEI : HTML : plain text)
  • Oldest item (1650) – The saints everlasting rest, or, A treatise of the blessed state of the saints in their enjoyment of God in glory wherein is shewed its excellency and certainty, the misery of those that lose it, the way to attain it, and assurance of it, and how to live in the continual delightful forecasts of it and now published by Richard Baxter … (TEI : HTML : plain text)
  • Most recent (1697) – Mr. Richard Baxter’s last legacy in select admonitions and directions to all sober dissenters. (TEI : HTML : plain text)
  • Most thoughtful item – Short instructions for the sick: Especially who by contagion, or otherwise, are deprived of the presence of a faithfull pastor. / By Richard Baxter. (TEI : HTML : plain text)
  • Least thoughtful item – Dattodiad y qwestiwn mawr, beth sydd raid i ni ei wneuthur fel y byddom gadwedig. Athrawiaethau i fuchedd sanctaidd. / O waith y disinydd parchedig Mr. Richard Baxter. (TEI : HTML : plain text)
  • Biggest name dropper – R. Baxter’s sence of the subscribed articles of religion (TEI : HTML : plain text)
  • Fewest quotations – Additions to the poetical fragments of Rich. Baxter written for himself and communicated to such as are more for serious verse than smooth. (TEI : HTML : plain text)
  • Most colorful – The certainty of the worlds of spirits and, consequently, of the immortality of souls of the malice and misery of the devils and the damned : and of the blessedness of the justified, fully evinced by the unquestionable histories of apparitions, operations, witchcrafts, voices &c. / written, as an addition to many other treatises for the conviction of Sadduces and infidels, by Richard Baxter. (TEI : HTML : plain text)
  • Ugliest – Richard Baxter his account to his dearly beloved, the inhabitants of Kidderminster, of the causes of his being forbidden by the Bishop of Worcester to preach within his diocess with the Bishop of Worcester’s letter in answer thereunto : and some short animadversions upon the said bishops letter. (TEI : HTML : plain text)

Marrying close and distant reading: A THATCamp project

Sunday, April 12th, 2015

The purpose of this page is to explore and demonstrate some of the possibilities of marrying close and distant reading. By combining both of these processes there is a hope greater comprehension and understanding of a corpus can be gained when compared to using close or distant reading alone. (This text might also be republished at http://dh.crc.nd.edu/sandbox/thatcamp-2015/ as well as http://nd2015.thatcamp.org/2015/04/07/close-and-distant/.)

To give this exploration a go, two texts are being used to form a corpus: 1) Machiavelli’s The Prince and 2) Emerson’s Representative Men. Both texts were printed and bound into a single book (codex). The book is intended to be read in the traditional manner, and the layout includes extra wide margins allowing the reader to liberally write/draw in the margins. As the glue is drying on the book, the plain text versions of the texts were evaluated using a number of rudimentary text mining techniques and with the results made available here. Both the traditional reading as well as the text mining are aimed towards answering a few questions. How do both Machiavelli and Emerson define a “great” man? What characteristics do “great” mean have? What sorts of things have “great” men accomplished?

Comparison

Feature The Prince Representative Men
Author Niccolò di Bernardo dei Machiavelli (1469 – 1527) Ralph Waldo Emerson (1803 – 1882)
Title The Prince Representative Men
Date 1532 1850
Fulltext plain text | HTML | PDF | TEI/XML plain text | HTML | PDF | TEI/XML
Length 31,179 words 59,600 words
Fog score 23.1 14.6
Flesch score 33.5 52.9
Kincaid score 19.7 11.5
Frequencies unigrams, bigrams, trigrams, quadgrams, quintgrams unigrams, bigrams, trigrams, quadgrams, quintgrams
Parts-of-speech nouns, pronouns, adjectives, verbs, adverbs nouns, pronouns, adjectives, verbs, adverbs

Search

Search for “man or men” in The Prince. Search for “man or men” in Representative Men.

Observations

I observe this project to be a qualified success.

First, I was able to print and bind my book, and while the glue is still trying, I’m confident the final results will be more than usable. The real tests of the bound book are to see if: 1) I actually read it, 2) I annotate it using my personal method, and 3) if I am able to identify answers to my research questions, above.


bookmaking tools

almost done

Second, the text mining services turned out to be more of a compare & contrast methodology as opposed to a question-answering process. For example, I can see that one book was written hundreds of years before the other. The second book is almost twice as long and the first. Readability score-wise, Machiavelli is almost certainly written for the more educated and Emerson is easier to read. The frequencies and parts-of-speech are enumerative, but not necessarily illustrative. There are a number of ways the frequencies and parts-of-speech could be improved. For example, just about everything could be visualized into histograms or word clouds. The verbs ought to lemmatized. The frequencies ought to be depicted as ratios compared to the texts. Other measures could be created as well. For example, my Great Books Coefficient could be employed.

How do Emerson and Machiavelli define a “great” man. Hmmm… Well, I’m not sure. It is relatively easy to get “definitions” of men in both books (The Prince or Representative Men). And network diagrams illustrating what words are used “in the same breath” as the word man in both works are not very dissimilar:


“man” in The Prince

“man” in Representative men

I think I’m going to have to read the books to find the answer. Really.

Code

Bunches o’ code was written to produce the reports:

  • concordance.cgi – the simple search engine
  • fathom.pl – used to compute the readability scores
  • file2pos.py – create a parts-of-speech file for later use
  • network.cgi – used to display words used “in the same breath” a given word
  • ngrams.pl – compute ngrams
  • pos.py – count and tabulate parts-of-speech from a previously created file

You can download this entire project — code and all — from http://dh.crc.nd.edu/sandbox/thatcamp-2015/reports/thatcamp-2015.tar.gz or ./wp-content/uploads/2015/04/thatcamp-2015.tar.gz.

Text mining Charles Dickens

Saturday, December 4th, 2010

This posting outlines how a person can do a bit of text mining against three works by Charles Dickens using a set of two Perl modules — Lingua::EN::Ngram and Lingua::Concordance.

Lingua::EN::Ngram

I recently wrote a Perl module called Lingua::EN::Ngram. Its primary purpose is to count all the ngrams (two-word phrases, three-word phrases, n-word phrases, etc.) in a given text. For two-word phrases (bigrams) it will order the output according to a statistical probability (t-score). Given a number of texts, it will count the ngrams common across the corpus. As of version 0.02 it supports non-ASCII characters making it possible to correctly read and parse a greater number of Romantic languages — meaning it correctly interprets characters with diacritics. Lingua::EN::Ngram is available from CPAN.

Lingua::Concordance

Concordances are just about the oldest of textual analysis tools. Originally developed in the Late Middle Ages to analyze the Bible, they are essentially KWIC (keyword in context) indexes used to search and display ngrams within the greater context of a work. Given a text (such as a book or journal article) and a query (regular expression), Lingua::Concordance can display the occurrences of the query in the text as well as map their locations across the entire text. In a previous blog posting I used Lingua::Concordance to compare & contrast the use of the phrase “good man” in the works of Aristotle, Plato, and Shakespeare. Lingua::Concordance too is available from CPAN.

Charles Dickens

In keeping with the season, I wondered about Charles Dickens’s A Christmas Carol. How often is the word “Christmas” used in the work and where? In terms of size, how does A Christmas Carol compare to some of other Dickens’s works? Are there sets of commonly used words or phrases between those texts?

Answering the first question was relatively easy. The word “Christmas” is occurs eighty-six (86) times, and twenty-two (22) of those occurrences are in the the first ten percent (10%) of the story. The following bar chart illustrates these facts:

bar chart

The length of books (or just about any text) measured in pages in ambiguous, at best. A much more meaningful measure is number of words. The following table lists the sizes, in words, of three Dickens stories:

story size in words
A Christmas Carol 28,207
Oliver Twist 156,955
David Copperfield 355,203

For some reason I thought A Christmas Carol was much longer.

A long time ago I calculated the average size (in words) of the books in my Alex Catalogue. Once I figured this out, I discovered I could describe items in the collection based on relative sizes. The following “dial” charts bring the point home. Each one of the books is significantly different in size:

christmas carol
A Christmas Carol
oliver twist
Oliver Twist
david copperfield
David Copperfield

If a person were pressed for time, then which story would you be able to read?

After looking for common ngrams between texts, I discovered that “taken with a violent fit of” appears both David Copperfield and A Christmas Carol. Interesting!? Moreover, the phrase “violent fit” appears on all three works. Specifically, characters in these three Dickens stories have violent fits of laughter, crying, trembling, and coughing. By concatenating the stories together and applying concordancing methods I see there are quite a number of violent things in the three stories:

  n such breathless haste and violent agitation, as seemed to betoken so
  ood-night, good-night!' The violent agitation of the girl, and the app
  sberne) entered the room in violent agitation. 'The man will be taken,
  o understand that, from the violent and sanguinary onset of Oliver Twi
  one and all, to entertain a violent and deeply-rooted antipathy to goi
  eep a little register of my violent attachments, with the date, durati
  cal laugh, which threatened violent consequences. 'But, my dear,' said
  in general, into a state of violent consternation. I came into the roo
  artly to keep pace with the violent current of her own thoughts: soon 
  ts and wiles have brought a violent death upon the head of one worth m
   There were twenty score of violent deaths in one long minute of that 
  id the woman, making a more violent effort than before; 'the mother, w
   as it were, by making some violent effort to save himself from fallin
  behind. This was rather too violent exercise to last long. When they w
   getting my chin by dint of violent exertion above the rusty nails on 
  en who seem to have taken a violent fancy to him, whether he will or n
  peared, he was taken with a violent fit of trembling. Five minutes, te
  , when she was taken with a violent fit of laughter; and after two or 
  he immediate precursor of a violent fit of crying. Under this impressi
  and immediately fell into a violent fit of coughing: which delighted T
  of such repose, fell into a violent flurry, tossing their wild arms ab
   and accompanying them with violent gesticulation, the boy actually th
  ght I really must have laid violent hands upon myself, when Miss Mills
   arm tied up, these men lay violent hands upon him -- by doing which, 
   every aggravation that her violent hate -- I love her for it now -- c
   work himself into the most violent heats, and deliver the most wither
  terics were usually of that violent kind which the patient fights and 
   me against the donkey in a violent manner, as if there were any affin
   to keep down by force some violent outbreak. 'Let me go, will you,--t
  hands with me - which was a violent proceeding for him, his usual cour
  en.' 'Well, sir, there were violent quarrels at first, I assure you,' 
  revent the escape of such a violent roar, that the abused Mr. Chitling
  t gradually resolved into a violent run. After completely exhausting h
  , on which he ever showed a violent temper or swore an oath, was this 
  ullen, rebellious spirit; a violent temper; and an untoward, intractab
  fe of Oliver Twist had this violent termination or no. CHAPTER III REL
  in, and seemed to presage a violent thunder-storm, when Mr. and Mrs. B
  f the theatre, are blind to violent transitions and abrupt impulses of
  ming into my house, in this violent way? Do you want to rob me, or to

These observations simply beg other questions. Is violence a common theme in Dickens works? What other adjectives are used to a greater or lesser degree in Dickens works? How does the use of these adjectives differ from other authors of the same time period or within the canon of English literature?

Summary

The combination of the Internet, copious amounts of freely available full text, and ubiquitous as well as powerful desktop computing, it is now possible to analyze texts in ways that was not feasible twenty years ago. While the application of computing techniques against texts dates back to at least Father Busa’s concordance work in the 1960s, it has only been in the last decade that digital humanities has come into its own. The application of digital humanities to library work offers great opportunities for the profession. Their goals are similar and their tools are complementary. From my point of view, their combination is a marriage made in heaven.

A .zip file of the texts and scripts used to do the analysis is available for you to download and experiment with yourself. Enjoy.

Lingua::EN::Bigram (version 0.03)

Monday, August 23rd, 2010

I uploaded version 0.03 of Lingua::EN::Bigram to CPAN today, and it now supports not just bigrams, trigrams, quadgrams, but ngrams — an arbitrary phrase length.

In order to test it out, I quickly gathered together some of my more recent essays, concatonated them together, and applied Lingua::EN::Bigram against the result. Below is a list of the top 10 most common bigrams, trigrams, and quadgrams:

  bigrams                 trigrams                  quadgrams
  52  great ideas         36  the number of         25  the number of times
  43  open source         36  open source software  13  the total number of
  38  source software     32  as well as            10  at the same time
  29  great books         28  number of times       10  number of words in
  24  digital humanities  27  the use of            10  when it comes to
  23  good man            25  the great books       10  total number of documents
  22  full text           23  a set of              10  open source software is
  22  search results      20  eric lease morgan      9  number of times a
  20  lease morgan        20  a number of            9  as well as the
  20  eric lease          19  total number of        9  through the use of

Not surprising since I have been writing about the Great Books, digital humanities, indexing, and open source software. Re-affirming.

Lingu::EN::Bigram is available locally as well as from CPAN.

Text mining against NGC4Lib

Friday, June 25th, 2010

I “own” a mailing list called NCG4Lib. It’s purpose is to provide a forum for the discussion of all things “next generation library catalog”. As of this writing, there are about 2,000 subscribers.

Lately I have been asking myself, “What sorts of things get discussed on the list and who participates in the discussion?” I thought I’d try to answer this question with a bit of text mining. This analysis only covers the current year to date, 2010.

Author names

Even though there are as many as 2,000 subscribers, only a tiny few actually post comments. The following pie and line charts illustrate the point without naming any names. As you can see, eleven (11) people contribute 50% of the postings.

posters
11 people post 50% of the messages

The lie chart illustrates the same point differently; a few people post a lot. We definitely have a long tail going on here.

posters
They definitely represent a long tail

Subject lines

The most frequently used individual subject line words more or less reflect traditional library cataloging practices. MARC. MODS. Cataloging. OCLC. But also notice how the word “impasse” is included. This may reflect something about the list.

subject words
The subject words look “traditional”

I’m not quite sure what to make of the most commonly used subject word bigrams.

subject bigrams
‘Don’t know what to make of these bigrams

Body words

The most frequently used individual words in the body of the postings tell a nice story. Library. Information. Data. HTTP. But notice what is not there — books. I also don’t see things like collections, acquisitions, public, services, nor value or evaluation. Hmm…

body words
These tell a nice story

The most frequently used bigrams in the body of the messages tell an even more interesting story because the they are dominated by the names of people and things.

body bigrams
Names of people and things

The phrases “information services” and “technical services” do not necessarily fit my description. Using a concordance to see how these words were being used, I discovered they were overwhelmingly a part of one or more persons’ email signatures or job descriptions. Not what I was hoping for. (Sigh.)

Conclusions

Based on these observations, as well as my personal experience, I believe the NGC4Lib mailing list needs more balance. It needs more balance in a couple of ways:

  1. There are too few people who post the majority of the content. The opinions of eleven people do not, IMHO, represent the ideas and beliefs of more than 2,000. I am hoping these few people understand this and will moderate themselves accordingly.
  2. The discussion is too much focused, IMHO, on traditional library cataloging. There is so much more to the catalog than metadata. We need to be asking questions about what it contains, how that stuff is selected and how it gets in there, what the stuff is used for, and how all of this fits into the broader, worldwide information environment. We need to be discussing issues of collection and dissemination, not just organization. Put another way, I wish I had not used the word “catalog” in the name of the list because I think the word brings along too many connotations and preconceived ideas.

As the owner of the list, what will I do? Frankly, I don’t know. Your thoughts and comments are welcome.

Tarzan of the Apes

Monday, December 1st, 2008

This is a simple word cloud of Edgar Rice Burroughs’ Tarzan of the Apes:

[openbook]978-1593082277[/openbook]

tarzan  little  clayton  great  jungle  before  d’arnot  jane  back  about  cabin  mr  toward  porter  professor  saw  again  time  philander  eyes  strange  know  first  here  though  never  old  turned  many  after  black  forest  left  hand  own  thought  day  knew  beneath  body  head  see  young  life  long  found  most  girl  lay  village  face  tribe  wild  away  tree  until  ape  down  must  seen  far  within  door  white  few  much  esmeralda  savage  above  once  dead  mighty  ground  stood  side  last  trees  apes  cried  thing  among  moment  took  hands  new  off  without  almost  beast  huge  alone  close  just  tut  canler  nor  way  knife  small  

I found this story to have quite a number of similarities with James Fenimore Cooper’s The Last of the Mohicans. The central character in both was super human. Both includes some sort of wilderness. In the Last of the Mohicans it was the forest. In Tarzan it was the jungle. In both cases the wilderness was inhabited by savages. Indians, apes, or pirates. Both included damsels in distress who were treated in a rather Victorian manner and were sought after by an unwanted lover. Both included characters with little common sense. David and Professor Porter.

I found Tarzan much more readable and story-like compared to the Last of the Mohicans. It can really be divided into two parts. The first half is a character development. Who is Tarzan, and how did he becomes who he is. The second half is a love story, more or less, where Tarzan pursues his love. I found it rather distasteful that Tarzan was a man of “breeding“. I don’t think people are to bred like animals.