Archive for June, 2009

Lingua::EN::Bigram (version 0.01)

Tuesday, June 23rd, 2009

Below is the POD (Plain O’ Documentation) file describing a Perl module I wrote called Lingua::EN::Bigram.

The purpose of the module is to: 1) extract all of the two-word phrases from a given text, and 2) rank each phrase according to its probability of occurance. Very nice for doing textual analysis. For example, by applying this module to Mark Twain’s Adventures of Tom Sawyer it becomes evident that the signifcant two-word phrases are names of characters in the story. On the other hand, Ralph Waldo Emerson’s Essays: First Series returns action statements — instructions. On the other hand Henry David Thoreau’s Walden returns “walden pond” and descriptions of pine trees. Interesting.

The code is available here or on CPAN.

NAME

Lingua::EN::Bigram – Calculate significant two-word phrases based on frequency and/or T-Score

SYNOPSIS

  use Lingua::EN::Bigram;
  $bigram = Lingua::EN::Bigram->new;
  $bigram->text( 'All men by nature desire to know. An indication of this...' );
  $tscore = $bigram->tscore;
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {
  
    print "$$tscore{ $_ }\t" . "$_\n";
  
  }

DESCRIPTION

This module is designed to: 1) pull out all of the two-word phrases (collocations or “bigrams”) in a given text, and 2) list these phrases according to thier frequency and/or T-Score. Using this module is it possible to create list of the most common two-word phrases in a text as well as order them by their probable occurance, thus implying significance.

METHODS

new

Create a new, empty bigram object:

  # initalize
  $bigram = Lingua::EN::Bigram->new;

text

Set or get the text to be analyzed:

  # set the attribute
  $bigram->text( 'All good things must come to an end...' );
  
  # get the attribute
  $text = $bigram->text;

words

Return a list of all the tokens in a text. Each token will be a word or puncutation mark:

  # get words
  @words = $bigram->words;

word_count

Return a reference to a hash whose keys are a token and whose values are the number of times the token occurs in the text:

  # get word count
  $word_count = $bigram->word_count;
  
  # list the words according to frequency
  foreach ( sort { $$word_count{ $b } <=> $$word_count{ $a } } keys %$word_count ) {
  
    print $$word_count{ $_ }, "\t$_\n";
  
  }

bigrams

Return a list of all bigrams in the text. Each item will be a pair of tokens and the tokens may consist of words or puncutation marks:

  # get bigrams
  @bigrams = $bigram->bigrams;

bigram_count

Return a reference to a hash whose keys are a bigram and whose values are the frequency of the bigram in the text:

  # get bigram count
  $bigram_count = $bigram->bigram_count;
  
  # list the bigrams according to frequency
  foreach ( sort { $$bigram_count{ $b } <=> $$bigram_count{ $a } } keys %$bigram_count ) {
  
    print $$bigram_count{ $_ }, "\t$_\n";
  
  }

tscore

Return a reference to a hash whose keys are a bigram and whose values are a T-Score — a probabalistic calculation determining the significance of bigram occuring in the text:

  # get t-score
  $tscore = $bigram->tscore;
  
  # list bigrams according to t-score
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {
  
    print "$$tscore{ $_ }\t" . "$_\n";
  
  }

DISCUSSION

Given the increasing availability of full text materials, this module is intended to help “digital humanists” apply mathematical methods to the analysis of texts. For example, the developer can extract the high-frequency words using the word_count method and allow the user to search for those words in a concordance. The bigram_count method simply returns the frequency of a given bigram, but the tscore method can order them in a more finely tuned manner.

Consider using T-Score-weighted bigrams as classification terms to supplement the “aboutness” of texts. Concatonate many texts together and look for common phrases written by the author. Compare these commonly used phrases to the commonly used phrases of other authors.

Each bigram includes punctuation. This is intentional. Developers may need want to remove bigrams containing such values from the output. Similarly, no effort has been made to remove commonly used words — stop words — from the methods. Consider the use of Lingua::StopWords, Lingua::EN::StopWords, or the creation of your own stop word list to make output more meaningful. The distribution came with a script (bin/bigrams.pl) demonstrating how to remove puncutation and stop words from the displayed output.

Finally, this is not the only module supporting bigram extraction. See also Text::NSP which supports n-gram extraction.

TODO

There are probably a number of ways the module can be improved:

  • the constructor method could take a scalar as input, thus reducing the need for the text method
  • the distribution’s license should probably be changed to the Perl Aristic License
  • the addition of alternative T-Score calculations would be nice
  • it would be nice to support n-grams
  • make sure the module works with character sets beyond ASCII

ACKNOWLEDGEMENTS

T-Score is calculated as per Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. Page 109.

AUTHOR

Eric Lease Morgan <eric_morgan@infomotions.com>

Lingua::Concordance (version 0.01)

Wednesday, June 10th, 2009

Below is a man page describing a Perl I module I recently wrote called Lingua::Concordance (version 0.01).

Given the increasing availability of full text books and journals, I think it behooves the library profession to aggressively explore the possibilities of providing services against text as a means of making the proverbial fire hose of information more useful. Providing concordance-like functions against texts is just one example.

The distribution is available from this blog as well as CPAN.

NAME

Lingua::Concordance – Keyword-in-context (KWIC) search interface

SYNOPSIS

  use Lingua::Concordance;
  $concordance = Lingua::Concordance->new;
  $concordance->text( 'A long time ago, in a galaxy far far away...' );
  $concordance->query( 'far' );
  foreach ( $concordance->lines ) { print "$_\n" }

DESCRIPTION

Given a scalar (such as the content of a plain text electronic book or journal article) and a regular expression, this module implements a simple keyword-in-context (KWIC) search interface — a concordance. Its purpose is to return lists of lines from a text containing the given expression. See the Discussion section, below, for more detail.

METHODS

new

Create a new, empty concordance object:

  $concordance = Lingua::Concordance->new;

text

Set or get the value of the concordance’s text attribute where the input is expected to be a scalar containing some large amount of content, like an electronic book or journal article:

  # set text attribute
  $concordance->text( 'Call me Ishmael. Some years ago- never mind how long...' );

  # get the text attribute
  $text = $concordance->text;

Note: The scalar passed to this method gets internally normalized, specifically, all carriage returns are changed to spaces, and multiple spaces are changed to single spaces.

query

Set or get the value of the concordance’s query attribute. The input is expected to be a regular expression but a simple word or phrase will work just fine:

  # set query attribute
  $concordance->query( 'Ishmael' );

  # get query attribute
  $query = $concordance->query;

See the Discussion section, below, for ways to make the most of this method through the use of powerful regular expressions. This is where the fun it.

radius

Set or get the length of each line returned from the lines method, below. Each line will be padded on the left and the right of the query with the number of characters necessary to equal the value of radius. This makes it easier to sort the lines:

  # set radius attribute
  $concordance->radius( $integer );

  # get radius attribute
  $integer = $concordance->query;

For terminal-based applications it is usually not reasonable to set this value to greater than 30. Web-based applications can use arbitrarily large numbers. The internally set default value is 20.

sort

Set or get the type of line sorting:

  # set sort attribute
  $concordance->sort( 'left' );

  # get sort attribute
  $sort = $concordance->sort;

Valid values include:

  • none – the default value; sorts lines in the order they appear in the text — no sorting
  • left – sorts lines by the (ordinal) word to the left of the query, as defined the ordinal method, below
  • right – sorts lines by the (ordinal) word to the right of the query, as defined the ordinal method, below
  • match – sorts lines by the value of the query (mostly)

This is good for looking for patterns in texts, such as collocations (phrases, bi-grams, and n-grams). Again, see the Discussion section for hints.

ordinal

Set or get the number of words to the left or right of the query to be used for sorting purposes. The internally set default value is 1:

  # set ordinal attribute
  $concordance->ordinal( 2 );

  # get ordinal attribute
  $integer = $concordance->ordinal;

Used in combination with the sort method, above, this is good for looking for textual patterns. See the Discussion section for more information.

lines

Return a list of lines from the text matching the query. Our reason de existance:

  @lines = $concordance->lines;

DISCUSSION

[Elaborate upon a number of things here such as but not limited to: 1) the history of concordances and concordance systems, 2) the usefulness of concordances in the study of linguistics, 3) how to exploit regular expressions to get the most out of a text and find interesting snippets, and 4) how the module might be implemented in scripts and programs.]

BUGS

The internal _by_match subroutine, the one used to sort results by the matching regular expression, does not work exactly as expected. Instead of sorting by the matching regular expression, it sorts by the string exactly to the right of the matched regular expression. Consequently, for queries such as ‘human’, it correctly matches and sorts on human, humanity, and humans, but matches such as Humanity do not necessarily come before humanity.

TODO

  • Write Discussion section.
  • Implement error checking.
  • Fix the _by_match bug.
  • Enable all of the configuration methods (text, query, radius, sort, and ordinal) to be specified in the constructor.
  • Require the text and query attributes to be specified as a part of the constructor, maybe.
  • Remove line-feed characters while normalizing text to accomdate Windows-based text streams, maybe.
  • Write an example CGI script, to accompany the distribution’s terminal-based script, demonstrating how the module can be implemented in a Web interface.
  • Write a full-featured terminal-based script enhancing the one found in the distribution.

ACKNOWLEDGEMENTS

The module implements, almost verbatim, the concordance programs and subroutines described in Bilisoly, R. (2008). Practical text mining with Perl. Wiley series on methods and applications in data mining. Hoboken, N.J.: Wiley. pgs: 169-185. “Thanks Roger. I couldn’t have done it without your book!”

EAD2MARC

Friday, June 5th, 2009

This posting simply shares three hacks I’ve written to enable me to convert EAD files to MARC records, and ultimately add them to my “discovery” layer — VUFind — for the Catholic Portal:

  • ead2marcxml.sh – Using xsltproc and a modified version of Terry Reese’s XSL stylesheet, converts all the EAD/.xml files in the current directory into MARCXML files. “Thanks Terry!”
  • marcxml2marc.sh – Using yaz-marcdump, convert all .marcxml files in the current directory into “real” MARC records.
  • add-001.pl – A hack to add 001 fields to MARC records. Sometimes necessary since the EAD files do not always have unique identifiers.

The distribution is available in the archives, and distributed under the GNU Public License.

Now, off to go fishing.

Text mining: Books and Perl modules

Wednesday, June 3rd, 2009

This posting simply lists some of the books I’ve read and Perl modules I’ve explored in regards to the field of text mining.

Through my explorations of term frequency/inverse document frequency (TFIDF) I became aware of a relatively new field of study called text mining. In many ways, text mining is similar to data mining only applied to unstructured texts instead of database rows and columns. Think plain text books such as items from Project Gutenberg or the Open Content Alliance. Text mining is a process including automatic classification, clustering (similar but distinct from classification), indexing and searching, entity extraction (names, places, organization, dates, etc.), statistically significant keyword and phrase extraction, parts of speech tagging, and summarization.

As a librarian, I found the whole thing extremely fascinating, consequently I read more.

Books

I have found the following four books helpful. They have enabled me to learn about the principles of text mining.

  • Bilisoly, R. (2008). Practical text mining with Perl. Wiley series on methods and applications in data mining. Hoboken, N.J.: Wiley. – Of all the books listed here, this one includes the most Perl programming examples, and it is not as scholarly as the balance of the list. Much of the book surrounds the description of regular expressions against texts. Its strongest suit is the creation of terminal-based concordance scripts. Very nice. Lot’s of fun. The concordances return very interesting results. The book does describe clustering techniques too, but the on the overall topic of automatic metadata generation the book is not very strong.
  • Konchady, M. (2006). Text mining application programming. Charles River Media programming series. Boston, Mass: Charles River Media. – This book is a readable survey of text mining covering parts of speech (POS) tagging, information extraction, search engines, clustering, classification, summarization, and question/answer processing. Many models for each aspect of text mining are described, compared, and contrasted. To put the author’s knowledge into practice, the book comes with a CD containing a Perl library for text mining, sample applications, and CGI scripts. This library is freely available on the Web.
  • Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer. – Of the four books listed here, this one is probably the most dense. I found its Perl scripts used to parse text more useful than the ones in Bilisoly, but this one included no concordance applications. I also found the description of n-grams to be very interesting — the extraction of multi-word phrases. I suspect the model they describe can be extended to n number of words. This book also discusses parts of speech (POS) processing but it is the only one that describes how to really parse language. Think semantics, lexicons, discourse, and dialog. After the first couple of chapters the Perl examples disappear and give way to exclusively Prologue examples.
  • Weiss, S. M. (2005). Text mining: Predictive methods for analyzing unstructured information. New York: Springer. – The complexity of this book lies between Konchady and Nugues; it includes a greater number of mathematical models than Konchady, but it is easier to read than Nugues. Broad topics include textual documents as numeric vectors, using text for prediction, information retrieval, clustering & classification, and looking for information in documents. Each chapter includes a section called “Historical and Bibliographical Remarks” which has proved to be very interesting reading.

When it comes to the process of text mining I found each of these books useful in their own right. Each provided me with ways to reading texts, parsing texts, counting words, counting phrases, and through the application of statistical analysis create lists and readable summaries denoting the “aboutness” of given documents.

Perl modules

As a Perl hacker I am interested in writing scripts putting into practice some of the things I learn. Listed here are a number of modules that have gotten me further along in regard to text mining:

  • Lingua::EN::Fathom – This library outputs interesting statistics regarding a given document: number of words and the number of times each occurs, number of sentences, complexity of words, number of paragraphs, etc. Of greatest interest are numbers (Fog, Flesch, and Flesch-Kincaid) denoting the readability of the text. Quick. Easy. Useful.
  • Lingua::EN::Keywords – Given a text, this library outputs a list of what it thinks are the most significant individual words in a document, sans stop words. Not fancy.
  • Lingua::EN::NamedEntity – Given a text, I believe this library comes pre-trained to extract names, places, and organizations from texts. It returns a Perl data structure listing the probabilities of a word or phrase being any particular entity. It may need to be re-trained to work for your corpus.
  • Lingua::EN::Semtags::Engine – Given text this module will return words and phrases in a relevancy ranked order. Initially, I have had some problems using this module because it seems to take a long time to return. On the other hand, it looks promising since it returns both individual words as well as phrases.
  • Lingua::EN::Summarize – Given a text this library returns sentences it thinks encapsulates the essence of the document. The result is readable — grammatically correct. The process it uses to accomplish its task is self-proclaimed as unscientific.
  • Lingua::EN::Tagger – This library marks up a document in pseudo XML with tags denoting parts of speech in a given document. To do this work it also can extract words, noun phrases, and sentences from a text. Zippy. Probability-based. Developers are expected to parse the tagged output and do analysis against it, such as count the number of times particular parts of speech occur.
  • Lingua::StopWords – Returns a simple list of stop words. Easy, but I can’t figure out how customizable it is. “One person’s stop word list is another person research topic.”
  • Net::Dict – A network interface to DICT (dictionary) servers. While the DICT protocol is a bit long in the tooth, and not quite as cool as Web interfaces to things like Google or Wikipedia, this module does provide a handy way to look up definitions, a complimentary functionality to WordNet.
  • Text::Aspell – A Perl interface to GNU Aspell which is great for spell-checking applications.
  • TextMine – This is a set of modules written by Manu Konchady the author of Text Mining Application Programming. It includes submodules named Cluster, Entity, Index, Pos, Quanda (Q & A), Summary, Tokens, and WordNet. While this set of modules is the most comprehensive I’ve seen, and while they are probably the most theoretically based interfacing with things like WordNet to be thorough, my initial experience has been a bit frustrating since scripts written against the libraries do not turn very quickly. Maybe I’m feeding them documents that are too large and if so, then the libraries are not necessarily scalable.
  • WordNet – There are a bevy of modules providing functionality against WordNet — a “lexical database of English… Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.” Any truly thorough text mining application of English will take advantage of WordNet.

Text mining and librarianship

Given the volume of “born digital” material being created, it is not possible to apply traditional library methods against them. The hand-crafted, heavy human touch process is not scalable. Given the amounts of mass digitized text being generated from the Google Books Project and the Open Content Alliance, new opportunities for literary analysis make themselves evident. Again, the traditional library processes can not fill the bill in these regards.

Text mining techniques offer possible solutions to these problems. Count words. Count phrases. Compare these words, phrases, and counts to other texts. Determine their statistical significance. Assign them to documents in the form of subject headings, keywords, author names, and other added entries in our metadata formats. Given large numbers of books, articles, and other “wordy” documents, learn how to “save the time of the reader” by summarizing these documents and ranking them in some sort of order in addition to alphabetical or date. Compare and contrast full text works by learning what words and types of words are used in documents. Are the words religious in nature? Mathematic and scientific? Poetic? Such things will provide additional means for understanding and interpreting everything from scholarly journal articles to works of classic fiction and philosophy. These techniques are not intended to replace existing methods of understanding and organization, but rather to supplement and build upon them. This is an evolutionary process.`

If libraries and librarians desire to remain relevant in the evolving information environment, then they will need to do the good work they do differently. The problem to be solved now-a-days is less about access and more about use. Text mining is one way of making the content of libraries more useful.

Interent Archive content in “discovery” systems

Tuesday, June 2nd, 2009

This quick posting describes how Internet Archive content, specifically, content from the Open Content Alliance can be quickly and easily incorporated into local library “discovery” systems. VuFind is used here as the particular example:

  1. Get keys – The first step is to get a set of keys describing the content you desire. This can be acquired through the Internet Archive’s advanced search interface.
  2. Convert keys – The next step is to convert the keys into sets of URLs pointing to the content you want to download. Fortunately, all the URLs have a similar shape: http://www.archive.org/download/KEY/KEY.pdf, http://www.archive.org/download/KEY/KEY_meta.mrc, or http://www.archive.org/download/KEY/KEY__djvu.txt.
  3. Download – Feed the resulting URLs to your favorite spidering/mirroring application. I use wget.
  4. Update – Enhance the downloaded MARC records with 856$u valued denoting the location of your local PDF copy as well as the original (cononical) version.
  5. Index – Add the resulting MARC records to your “discovery” system.

Linked here is a small distribution of shell and Perl scripts that do this work for me and incorporate the content into VuFind. Here is how they can be used:

  $ getkeys.sh > catholic.keys
  $ keys2urls.pl catholic.keys > catholic.urls
  $ mirror.sh catholic.urls
  $ updatemarc.pl
  $ find /usr/var/html/etexts -name '*.marc' /
  -exec cat {} >> /usr/local/vufind/marc/archive.marc \;
  $ cd /usr/local/vufind
  $ ./import.sh marc/archive.marc
  $ sudo ./vufind.sh restart

Cool next steps would be use text mining techniques against the downloaded plain text versions of the documents to create summaries, extract named entities, and identify possible subjects. These items could then be inserted into the MARC records to enhance retrieval. Ideally the full text would be indexed, but alas, MARC does not accomodate that. “MARC must die.”