Archive for June, 2010

ALA 2010

Wednesday, June 30th, 2010

ALA 2010This is the briefest of travelogues describing my experience at the 2010 ALA Annual Meeting in Washington (DC).

Pat Lawton and I gave a presentation at the White House Four Points Hotel on the “Catholic Portal“. Essentially it was a status report. We shared the podium with Jon Miller (University of Southern California) who described the International Mission Photography Archive — an extensive collection of photographs taken by missionaries from many denominations.

I then took the opportunity to visit my mother in Pennsylvania, but the significant point is the way I got out of town. I had lost my maps, and my iPad came to the rescue. The Google Maps application was very, very useful.

On Monday I shared a podium with John Blyberg (Darien Library) and Tim Spalding (LibraryThing) as a part of a Next-Generation Library Catalog Special Interest Group presentation. John provided an overview of the latest and greatest features of SOPAC. He emphasized a lot of user-centered design. Tim described library content and services as not (really) being a part of the Web. In many ways I agree with him. I outlined how a few digital humanities computing techniques could be incorporated into library collections and services in a presentation I called “The Next Next-Generation Library Catalog“. That afternoon I participated in a VUFind users-group meeting, and I learned that I am pretty much on target in regards to the features of this “discovery system”. Afterwards a number of us from the Catholic Research Resources Alliance (CRRA) listened to folks from Crivella West describe their vision of librarianship. The presentation was very interesting because they described how they have taken many collections of content and mined them for answers to questions. This is digital humanities to the extreme. Their software — the Knowledge Kiosk — is being used to analyze the content of John Henry Newman at the Newman Institute.

Tuesday morning was spent more with the CRRA. We ratified next year’s strategic plan. In the afternoon I visited a few of my friends at the Library of Congress (LOC). There I learned a bit how the LOC may be storing and archiving Twitter feeds. Interesting.

Text mining against NGC4Lib

Friday, June 25th, 2010

I “own” a mailing list called NCG4Lib. It’s purpose is to provide a forum for the discussion of all things “next generation library catalog”. As of this writing, there are about 2,000 subscribers.

Lately I have been asking myself, “What sorts of things get discussed on the list and who participates in the discussion?” I thought I’d try to answer this question with a bit of text mining. This analysis only covers the current year to date, 2010.

Author names

Even though there are as many as 2,000 subscribers, only a tiny few actually post comments. The following pie and line charts illustrate the point without naming any names. As you can see, eleven (11) people contribute 50% of the postings.

posters
11 people post 50% of the messages

The lie chart illustrates the same point differently; a few people post a lot. We definitely have a long tail going on here.

posters
They definitely represent a long tail

Subject lines

The most frequently used individual subject line words more or less reflect traditional library cataloging practices. MARC. MODS. Cataloging. OCLC. But also notice how the word “impasse” is included. This may reflect something about the list.

subject words
The subject words look “traditional”

I’m not quite sure what to make of the most commonly used subject word bigrams.

subject bigrams
‘Don’t know what to make of these bigrams

Body words

The most frequently used individual words in the body of the postings tell a nice story. Library. Information. Data. HTTP. But notice what is not there — books. I also don’t see things like collections, acquisitions, public, services, nor value or evaluation. Hmm…

body words
These tell a nice story

The most frequently used bigrams in the body of the messages tell an even more interesting story because the they are dominated by the names of people and things.

body bigrams
Names of people and things

The phrases “information services” and “technical services” do not necessarily fit my description. Using a concordance to see how these words were being used, I discovered they were overwhelmingly a part of one or more persons’ email signatures or job descriptions. Not what I was hoping for. (Sigh.)

Conclusions

Based on these observations, as well as my personal experience, I believe the NGC4Lib mailing list needs more balance. It needs more balance in a couple of ways:

  1. There are too few people who post the majority of the content. The opinions of eleven people do not, IMHO, represent the ideas and beliefs of more than 2,000. I am hoping these few people understand this and will moderate themselves accordingly.
  2. The discussion is too much focused, IMHO, on traditional library cataloging. There is so much more to the catalog than metadata. We need to be asking questions about what it contains, how that stuff is selected and how it gets in there, what the stuff is used for, and how all of this fits into the broader, worldwide information environment. We need to be discussing issues of collection and dissemination, not just organization. Put another way, I wish I had not used the word “catalog” in the name of the list because I think the word brings along too many connotations and preconceived ideas.

As the owner of the list, what will I do? Frankly, I don’t know. Your thoughts and comments are welcome.

The Next Next-Generation Library Catalog

Thursday, June 24th, 2010

With the advent of the Internet and wide-scale availability of full-text content, people are overwhelmed with the amount of accessible data and information. Library catalogs can only go so far when it comes to delimiting what is relevant and what is not. Even when the most exact searches return 100’s of hits what is a person to do? Services against texts — digital humanities computing techniques — represent a possible answer. Whether the content is represented by novels, works of literature, or scholarly journal articles the methods of the digital humanities can provide ways to compare & contrast, analyze, and make more useful any type of content. This essay elaborates on these ideas and describes how they can be integrated into the “next, next-generation library catalog”.

(Because this essay is the foundation for a presentation at the 2010 ALA Annual Meeting, this presentation is also available as a one-page handout designed for printing as well as bloated set of slides.)

Find is not the problem

Find is not the problem to be solved. At most, find is a means to an end and not the end itself. Instead, the problem to solve surrounds use. The profession needs to implement automated ways to make it easier users do things against content.

The library profession spends an inordinate amount of time and effort creating catalogs — essentially inventory lists of things a library owns (or licenses). The profession then puts a layer on top of this inventory list — complete with authority lists, controlled vocabularies, and ever-cryptic administrative data — to facilitate discovery. When poorly implemented, this discovery layer is seen by the library user as an impediment to their real goal. Read a book or article. Verify a fact. Learn a procedure. Compare & contrast one idea with another idea. Etc.

In just the past few years the library profession has learned that indexers (as opposed to databases) are the tools to facilitate find. This is true for two reasons. First, indexers reduce the need for users to know how the underlying data is structured. Second, indexers employ statistical analysis to rank it’s output by relevance. Databases are great for creating and maintaining content. Indexers are great for search. Both are needed in equal measures in order to implement the sort of information retrieval systems people have come to expect. For example, many of the profession’s current crop of “discovery” systems (VUFind, Blacklight, Summon, Primo, etc.) all use an open source indexer called Lucene to drive search.

This being the case, we can more or less call the problem of find solved. True, software is never done, and things can always be improved, but improvements in the realm of search will only be incremental.

Instead of focusing on find, the profession needs to focus on the next steps in the process. After a person does a search and gets back a list of results, what do they want to do? First, they will want to peruse the items in the list. After identifying items of interest, they will want to acquire them. Once the selected items are in hand users may want to print, but at the very least they will want to read. During the course of this reading the user may be doing any number of things. Ranking. Reviewing. Annotating. Summarizing. Evaluating. Looking for a specific fact. Extracting the essence of the author’s message. Comparing & contrasting the text to other texts. Looking for sets of themes. Tracing ideas both inside and outside the texts. In other words, find and acquire are just a means to greater ends. Find and acquire are library goals, not the goals of users.

People want to perform actions against the content they acquire. They want to use the content. They want to do stuff with it. By expanding our definition of “information literacy” to include things beyond metadata and bibliography, and by combining it with the power of computers, librarianship can further “save the time of the reader” and thus remain relevant in the current information environment. Focusing on the use and evaluation of information represents a growth opportunity for librarianship.

It starts with counting

The availability of full text content in the form of plain text files combined with the power of computing empowers one to do statistical analysis against corpora. Put another way, computers are great at counting words, and once sets of words are counted there are many things one can do with the results, such as but not limited to:

  • measuring length
  • measuring readability, “greatness”, or any other index
  • measuring frequency of unigrams, n-grams, parts-of-speech, etc.
  • charting & graphing analysis (word clouds, scatter plots, histograms, etc.)
  • analyzing measurements and looking for patterns
  • drawing conclusions and making hypotheses

For example, suppose you did the perfect search and identified all of the works of Plato, Aristotle, and Shakespeare. Then, if you had the full text, you could compute a simple table such as Table 1.

Author Works Words Average Grade Flesch
Plato 25 1,162,46 46,499 12-15 54
Aristotle 19 950,078 50,004 13-17 50
Shakespeare 36 856,594 23,794 7-10 72

The table lists who wrote how many works. It lists the number of words in each set of works and the average number of words per work. Finally, based on things like sentence length, it estimates grade and reading levels for the works. Given such information, a library “catalog” could help the patron could answer questions such as:

  • Which author has the most works?
  • Which author has the shortest works?
  • Which author is the most verbose?
  • Is the author of most works also the author who is the most verbose?
  • In general, which set of works requires the higher grade level?
  • Does the estimated grade/reading level of each authors’ work coincide with one’s expectations?
  • Are there any authors whose works are more or less similar in reading level?

Given the full text, a trivial program can then be written to count the number of words existing in a corpus as well as the number of times each word occurs, as shown in Table 2.

Plato Aristotle Shakespeare
will one thou
one will will
socrates must thy
may also shall
good things lord
said man thee
man may sir
say animals king
true thing good
shall two now
like time come
can can well
must another enter
another part love
men first let
now either hath
also like man
things good like
first case one
let nature upon
nature motion know
many since say
state others make
knowledge now may
two way yet

Table 2, sans a set of stop words, lists the most frequently used words in the complete works of Plato, Aristotle, and Shakespeare. The patron can then ask and answer questions like:

  • Are there words in one column that appear frequently in all columns?
  • Are there words that appear in only one column?
  • Are the rankings of the words similar between columns?
  • To what degree are the words in each column a part of larger groups such as: nouns, verbs, adjectives, etc.?
  • Are there many synonyms or antonyms shared inside or between the columns?

Notice how the words “one”, “good” and “man” appear in all three columns. Does that represent some sort of shared quality between the works?

If one word contains some meaning, then do two words contain twice as much meaning? Here is a list of the most common two-word phrases (bigrams) in each author corpus, Table 3.

Plato Aristotle Shakespeare
let us one another king henry
one another something else thou art
young socrates let uses thou hast
just now takes place king richard
first place one thing mark antony
every one without qualification prince henry
like manner middle term let us
every man first figure king lear
quite true b belongs thou shalt
two kinds take place duke vincentio
human life essential nature dost thou
one thing every one sir toby
will make practical wisdom art thou
human nature will belong henry v
human mind general rule richard iii
quite right anything else toby belch
modern times one might scene ii
young men first principle act iv
can hardly good man iv scene
will never two things exeunt king
will tell two kinds don pedro
dare say first place mistress quickly
will say like manner act iii
false opinion one kind thou dost
one else scientific knowledge sir john

Notice how the names of people appear frequently in Shakespeare’s works, but very few names appear in the lists of Plato and Aristotle. Notice how the word “thou” appears a lot in Shakespeare’s works. Ask yourself the meaning of the word “thou”, and decide whether or not to update the stop word list. Notice how the common phrases of Plato and Aristotle are akin to ideas, not tangible things. Examples include: human nature, practical wisdom, first principle, false opinion, etc. Is there a pattern here?

If “a picture is worth a thousand words”, then there are about six thousand words represented by Figures 1 through 6.

Words used by Plato
words used by Plato
Phrases used by Plato
phrases used by Plato
Words used by Aristotle
words used by Aristotle
Phrases used by Aristotle
phrases used by Aristotle
Words used by Shakespeare
words used by Shakespeare
Phrases used by Shakespeare
phrases used by Shakespeare

Word clouds — “tag clouds” — are an increasingly popular way to illustrate the frequency of words or phrases in a corpus. Because a few of the phrases in a couple of the corpuses were considered outliers, phrases such as “let us”, “one another”, and “something else” are not depicted.

Even without the use of statistics, it appears the use of the phrase “good man” by each author might be interestingly compared & contrasted. A concordance is an excellent tool for such a purpose, and below are a few of the more meaty uses of “good man” by each author.

List 1 – “good man” as used by Plato
  ngth or mere cleverness. To the good man, education is of all things the most pr
   Nothing evil can happen to the good man either in life or death, and his own de
  but one reply: 'The rule of one good man is better than the rule of all the rest
   SOCRATES: A just and pious and good man is the friend of the gods; is he not? P
  ry wise man who happens to be a good man is more than human (daimonion) both in 
List 2 – “good man” as used by Aristotle
  ons that shame is felt, and the good man will never voluntarily do bad actions. 
  reatest of goods. Therefore the good man should be a lover of self (for he will 
  hat is best for itself, and the good man obeys his reason. It is true of the goo
  theme If, as I said before, the good man has a right to rule because he is bette
  d prove that in some states the good man and the good citizen are the same, and 
List 3 – “good man” as used by Shakespeare
  r to that. SHYLOCK Antonio is a good man. BASSANIO Have you heard any imputation
  p out, the rest I'll whistle. A good man's fortune may grow out at heels: Give y
  t it, Thou canst not hit it, my good man. BOYET An I cannot, cannot, cannot, An 
  hy, look where he comes; and my good man too: he's as far from jealousy as I am 
   mean, that married her, alack, good man! And therefore banish'd -- is a creatur

What sorts of judgements might the patron be able to make based on the snippets listed above? Are Plato, Aristotle, and Shakespeare all defining the meaning of a “good man”? If so, then what are some of the definitions? Are there qualitative similarities and/or differences between the definitions?

Sometimes being as blunt as asking a direct question, like “What is a man?”, can be useful. Lists 4 through 6 try to answer it.

List 4 – “man is” as used by Plato
  stice, he is met by the fact that man is a social being, and he tries to harmoni
  ption of Not-being to difference. Man is a rational animal, and is not -- as man
  ss them. Or, as others have said: Man is man because he has the gift of speech; 
  wise man who happens to be a good man is more than human (daimonion) both in lif
  ied with the Protagorean saying, 'Man is the measure of all things;' and of this
List 5 – “man is” as used by Aristotle
  ronounced by the judgement 'every man is unjust', the same must needs hold good 
  ts are formed from a residue that man is the most naked in body of all animals a
  ated piece at draughts. Now, that man is more of a political animal than bees or
  hese vices later. The magnificent man is like an artist; for he can see what is 
  lement in the essential nature of man is knowledge; the apprehension of animal a
List 6 – “man is” as used by Shakespeare
   what I have said against it; for man is a giddy thing, and this is my conclusio
   of man to say what dream it was: man is but an ass, if he go about to expound t
  e a raven for a dove? The will of man is by his reason sway'd; And reason says y
  n you: let me ask you a question. Man is enemy to virginity; how may we barricad
  er, let us dine and never fret: A man is master of his liberty: Time is their ma

In the 1950s Mortimer Adler and a set of colleagues created a set of works they called The Great Books of the Western World. This 80-volume set included all the works of Plato, Aristotle, and Shakespeare as well as some of the works of Augustine, Aquinas, Milton, Kepler, Galileo, Newton, Melville, Kant, James, and Frued. Prior to the set’s creation, Adler and colleagues enumerated 102 “greatest ideas” including concepts such as: angel, art, beauty, honor, justice, science, truth, wisdom, war, etc. Each book in the series was selected for inclusion by the committee because of the way the books elaborated on the meaning of the “great ideas”.

Given the full text of each of the Great Books as well as a set of keywords (the “great ideas”), it is relatively simple to calculate a relevancy ranking score for each item in a corpus. Love is one of the “great ideas”, and it just so happens it is used most significantly by Shakespeare compared to the use of the other authors in the set. If Shakespeare has the highest “love quotient”, then what does Shakespeare have to say about love? List 7 is a brute force answer to such a question.

List 7 – “love is” as used by Shakespeare
  y attempted? Love is a familiar; Love is a devil: there is no evil angel but Lov
  er. VALENTINE Why? SPEED Because Love is blind. O, that you had mine eyes; or yo
   that. DUKE This very night; for Love is like a child, That longs for every thin
  n can express how much. ROSALIND Love is merely a madness, and, I tell you, dese
  of true minds Admit impediments. Love is not love Which alters when it alteratio

Do these definitions coincide with expectations? Maybe further reading is necessary.

Digital humanities, library science, and “catalogs”

The previous section is just about the most gentle introduction to digital humanities computing possible, but can also be an introduction to a new breed of library science and library catalogs.

It began by assuming the existence of full text content in plain text form — an increasingly reasonable assumption. After denoting a subset of content, it compared & contrasted the sizes and reading levels of the content. By counting individual words and phrases, patterns were discovered in the texts and a particular idea was loosely followed — specifically, the definition of a good man. Finally, the works of a particular author were compared to the works of a larger whole to learn how the author defined a particular “great idea”.

The fundamental tools used in this analysis were a set of rudimentary Perl modules: Lingua::EN::Fathom for calculating the total number of words in a document as well as a document’s reading level, Lingua::EN::Bigram for listing the most frequently occurring words and phrases, and Lingua::Concordance for listing sentence snippets. The Perl programs built on top of these modules are relatively short and include: fathom.pl, words.pl, bigrams.pl and concordance.pl. (If you really wanted to download the full text versions of Plato, Aristotle, and Shakespeare‘s works used in this analysis.) While the programs themselves are really toys, the potential they represent are not. It would not be too difficult to integrate their functionality into a library “catalog”. Assume the existence of significant amount of full text content in a library collection. Do a search against the collection. Create a subset of content. Click a few buttons to implement statistical analysis against the result. Enable the user to “browse” the content and follow a line of thought.

The process outlined in the previous section is not intended to replace rigorous reading, but rather to supplement it. It enables a person to identify trends quickly and easily. It enables a person to read at “Web scale”. Again, find is not the problem to be solved. People can find more information than they require. Instead, people need to use and analyze the content they find. This content can be anything from novels to textbooks, scholarly journal articles to blog postings, data sets to collections of images, etc. The process outlined above is an example of services against texts, a way to “Save the time of the reader” and empower them to make better and more informed decisions. The fundamental processes of librarianship (collection, preservation, organization, and dissemination) need to be expanded to fit the current digital environment. The services described above are examples of how processes can be expanded.

The next “next generation library catalog” is not about find, instead it is about use. Integrating digital humanities computing techniques into library collections and services is just one example of how this can be done.

Measuring the Great Books

Tuesday, June 15th, 2010

This posting describes how I am assigning quantitative characteristics to texts in an effort to answer the question, “How ‘great’ are the Great Books?” In the end I make a plea for library science.

Background

With the advent of copious amounts of freely available plain text on the ‘Net comes the ability of “read” entire corpora with a computer and apply statistical processes against the result. In an effort to explore the feasibility of this idea, I am spending time answering the question, “How ‘great’ are the Great Books?

More specifically, want to assign quantitative characteristics to each of the “books” in the Great Books set, look for patterns in the result, and see whether or not I can draw any conclusions about the corpus. If such processes are proven effective, then the same processes may be applicable to other corpora such as collections of scholarly journal articles, blog postings, mailing list archives, etc. If I get this far, then I hope to integrate these processes into traditional library collections and services in an effort to support their continued relevancy.

On my mark. Get set. Go.

Assigning quantitative characteristics to texts

The Great Books set posits 102 “great ideas” — basic, foundational themes running through the heart of Western civilization. Each of the books in the set were selected for inclusion by the way they expressed the essence of these great ideas. The ideas are grand and ambiguous. They include words such as angel, art, beauty, courage, desire, eternity, god, government, honor, idea, physics, religion, science, space, time, wisdom, etc. (See Appendix B of “How ‘great’ are the Great Books?” for the complete list.)

In a previous posting, “Great Ideas Coefficient“, I outlined the measure I propose to use to determine the books’ “greatness” — essentially a sum of all TFIDF (term frequency / inverse document frequency) scores as calculated against the list of great ideas. TFIDF is defined as:

( c / t ) * log( d / f )

where:

  • c = number of times a given word appears in a document
  • t = total number of words in a document
  • d = total number of documents in a corpus
  • f = total number of documents containing a given word

Thus, the problem boils down to determining the values for c, t, d, and f for a given great idea, 2) summing the resulting TFIDF scores, 3) saving the results, and 4) repeating the process for each book in the corpus. Here, more exactly, is how I am initially doing such a thing:

  1. Build corpus – In a previous posting, “Collecting the Great Books“, I described how I first collected 223 of the roughly 250 Great Books.
  2. Index corpus – The process used to calculate the TFIDF values of c and t are trivial because any number of computer programs do such a thing quickly and readily. In our case, the value of d is a constant — 223. On the other hand, trivial methods for determining the number of documents containing a given word (f) are not scalable as the size of a corpus increases. Because an index is essentially a list of words combined with the pointers to where the words can be found, an index proves to be a useful tool for determining the value of f. Index a corpus. Search the index for a word. Get back the number of hits and use it as the value for f. Lucene is currently the gold standard when it comes to open source indexers. Solr — an enhanced and Web Services-based interface to Lucene — is the indexer used in this process. The structure of the local index is rudimentary: id, author, title, URL, and full text. Each of the metadata values are pulled out of a previously created index file — great-books.xml — while the full text is read from the file system. The whole lot is then stuffed into Solr. A program called index.pl does this work. Another program called search.pl was created simply for testing the validity of the index.
  3. Count words and determine readability – A Perl module called Lingua::EN::Fathom does a nice job of counting the number of words in a file, thus providing me with a value for t. Along the way it also calculates a number of “readability” scores — values used to determine the necessary education level of a person needed to understand a given text. While I had “opened the patient” I figured it would be a good idea to take note of this information. Given the length of a book as well as its readability scores, I enable myself to answer questions such as, “Are longer books more difficult to read?” Later on, given my Great Ideas Coefficient, I will be able to answer questions such as “Is the length of a book a determining factor in ‘greatness’?” or “Are ‘great’ books more difficult to read?”
  4. Calculate TFIDF – This is the fuzziest and most difficult part of the measurement process. Using Lingua::EN::Fathom again I find all of the unique words in a document, stem them with Lingua::Stem::Snowball, and calculate the number of times each stem occurs. This gives me a value for c. I then loop through each great idea, stem them, and search the index for the stem thus returning a value for f. For each idea I now have values for c, t, d, and f enabling me to calculate TFIDF — ( c / t ) * log( d / f ).
  5. Calculate the Great Ideas Coefficient – This is trivial. Keep a running sum of all the great idea TFIDF scores.
  6. Go to Step #4 – Repeat this process for each of the 102 great ideas.
  7. Save – After all the various scores (number of words, readability scores, TFIDF scores, and Great Ideas Coefficient) have been calculated I save each to my pseudo database file called great-ideas.xml. Each is stored as an attribute associated with a book’s unique identifier. Later I will use the contents of this file as the basis of my statistical analysis.
  8. Go to Step #3 – Repeat this process for each book in the corpus, and in this case 223 times.

Of course I didn’t do all of this by hand, and the program I wrote to do the work is called measure.pl.

The result is my pseudo database file — great-books.xml. This is my data set. It keeps track all of my information in a human-readable, application- and operating system-independent manner. Very nice. If there is only one file you download from this blog posting, then it should be this file. Using it you will be able to create your own corpus and do your own analysis.

The process outlined above is far from perfect. First, there are a few false negatives. For example, the great idea “universe” returned a TFIDF value of zero (0) for every document. Obviously is is incorrect, and I think the error has something to do with the stemming and/or indexing subprocesses. Second, the word “being”, as calculated by TFIDF, is by far and away the “greatest” idea. I believe this is true because the word “being” is… being counted as both a noun as well as a verb. This points to a different problem — the ambiguity of the English language. While all of these issues will knowingly skew the final results, I do not think they negate the possibility of meaningful statistical investigation. At the same time it will be necessary to refine the measurement process to reduce the number of “errors”.

Measurment, the humanities, and library science

Measurement is one of the fundamental qualities of science. The work of Archimedes is the prototypical example. Kepler and Galileo took the process to another level. Newton brought it to full flower. Since Newton the use of measurement — the assignment of mathematical values — applied against observations of the natural world and human interactions have given rise to the physical and social sciences. Unlike studies in the humanities, science is repeatable and independently verifiable. It is objective. Such is not a value judgment, merely a statement of fact. While the sciences seem cold, hard, and dry, the humanities are subjective, appeal to our spirit, give us a sense of purpose, and tend to synthesis our experiences into a meaningful whole. Both of the scientific and humanistic thinking processes are necessary for us to make sense of the world around us. I call these combined processes “arscience“.

The library profession could benefit from the greater application of measurement. In my opinion, too much of the profession’s day-to-day as well as strategic decisions are based on antidotal evidence and gut feelings. Instead of basing our actions on data, actions are based on tradition. “This is the way we have always done it.” This is medieval, and consequently, change comes very slowly. I sincerely believe libraries are not going away any time soon, but I do think the profession will remain relevant longer if librarians were to do two things: 1) truly exploit the use of computers, and 2) base a greater number of their decisions on data — measurment — as opposed to opinion. Let’s call this library science.

Collecting the Great Books

Sunday, June 13th, 2010

In an effort to answer the question, “How ‘great’ are the Great Books?“, I need to mirror the full texts of the Great Books. This posting describes the initial process I am using to do such a thing, but the imporant thing to note is that this process is more about librarianship than it is about software.

Background

The Great Books is/was a 60-volume set of content intended to further a person’s liberal arts education. About 250 “books” in all, it consists of works by Homer, Aristotle, Augustine, Chaucer, Cervantes, Locke, Gibbon, Goethe, Marx, James, Freud, etc. There are a few places on the ‘Net where the complete list of authors/titles can be read. One such place is a previous blog posting of mine. My goal is to use digital humanities computing techniques to statistically describe the works and use these descriptions to supplement a person’s understanding of the texts. I then hope to apply these same techniques to other corpora. To accomplish this goal I first need to acquire full text versions of the Great Books. This posting describes how I am initially going about it.

Mirroring and caching the Great Books

All of the books of the Great Books were written by “old dead white men”. It is safe to assume the texts have been translated into a myriad of languages, including English, and it is safe to assume the majority exist in the public domain. Moreover, with the advent of the Web and various digitizing projects, it is safe to assume quality information gets copied forward and will be available for downloading. All of this has proven to be true. Through the use of Google and a relatively small number of repositories (Project Gutenberg, Alex Catalogue of Electronic Texts, Internet Classics Archive, Christian Classics Ethereal Library, Internet Archive, etc.), I have been able to locate and mirror 223 of the roughly 250 Great Books. Here’s how:

  1. Bookmark texts – Trawl the Web for the Great Books and use Delicious to bookmark links to plain text versions translated into English. Firefox combined with the Delicious extension have proven to be very helpful in this regard. My bookmarks should be located at http://delicious.com/ericmorgan/gb.
  2. Save and edit bookmarks file – Delicious gives you the option to save your bookmarks file locally. The result is a bogus HTML file intended to be imported into Web browsers. It contains the metadata used to describe your bookmarks such as title, notes, and URLs. After exporting my bookmarks to the local file system, I contorted the bogus HTML into rudimentary XML so I could systematically read it for subsequent processing.
  3. Extract URLs – Using a 7-line program called bookmarks2urls.pl, I loop through the edited bookmarks file and output all the URLs.
  4. Mirror content – Because I want/need to retain a pristine version of the original texts, I feed the URLs to wget and copy the texts to a local directory. This use of wget is combined with the output of Step #3 through a brain-dead shell script called mirror.sh.
  5. Create corpus – The mirrored files are poorly named; using just the mirror it is difficult to know what “great book” hides inside files named annals.mb.txt, pg2600.txt, or whatever. Moreover, no metadata is associated with the collection. Consequently I wrote a program — build-corpus.pl — that loops through my edited bookmarks file, extracts the necessary metadata (author, title, and URL), downloads the remote texts, saves them locally with a human-readable filename, creates a rudimentary XHTML page listing each title, and creates an XML file containing all of the metadata generated to date.

The results of this 5-step process include:

The most important file, by far, is the metadata file. It is intended to be a sort of application- and operating system-independent database. Given this file, anybody ought to be able to duplicate the analysis I propose to do later. If there is only one file you download from this blog posting, it should be the metadata file — great-books.xml.

The collection process is not perfect. I was unable to find many of the works of Archimedes, Copernicus, Kepler, Newton, Galileo, or Freud. For all but Freud, I attribute this to the lack of translations, but I suppose I could stoop to the use of poorly OCR’ed texts from Google Books. I attribute the unavailability of Freud to copyright issues. There’s no getting around that one. A few times I located HTML versions of desired texts, but HTML will ultimately skew my analysis. Consequently I used a terminal-based program called lynx to convert and locally save the remote HTML to a plain text file. I then included that file into my corpus. Alas, there are always ways to refine collections. Like software, they are are never done.

Summary — Collection development, acquisitions, and cataloging

The process outlined above is really about librarianship and not software. Specifically, it is about collection development, acquisitions, and cataloging. I first needed to articulate a development policy. While it did not explicitly describe the policy it did outline why I wanted to create the collection as well as a few of each item’s necessary qualities. The process above implemented a way to actually get the content — acquisitions. Finally, I described — “cataloged” — my content, albiet in a very rudimentary form.

It is an understatement to say the Internet has changed the way data, information, and knowledge are collected, preserved, organized, and disseminated. By extension, librarianship needs to change in order to remain relevant with the times. Our profession spends much of its time trying to refine old processes. It is like trying to figure out how to improve the workings of a radio when people have moved on to the use of televisions instead. While traditional library processes are still important, they are not as important as the used to be.

The processes outline above illustrate one possible way librarianship can change the how’s of its work while retaining it’s what’s.

Inaugural Code4Lib “Midwest” Regional Meeting

Saturday, June 12th, 2010

I believe the Inaugural Code4Lib “Midwest” Regional Meeting (June 11 & 12, 2010 at the University of Notre Dame) was a qualified success.

About twenty-six people attended. (At least that was the number of people who went to lunch.) They came from Michigan, Ohio, Iowa, Indiana, and Illinois. Julia Bauder won the prize for coming the furthest distance away — Grinnell, Iowa.


Day #1

We began with Lightning Talks:

  • ePub files by Michael Kreyche
  • FRBR and MARC data by Kelley McGrath
  • Great Books by myself
  • jQuery and the OPAC by Ken Irwin
  • Notre Dame and the Big Ten by Michael Witt
  • Solr & Drupal by Rob Casson
  • Subject headings via a Web Service by Michael Kreyche
  • Taverna by Rick Johnson and Banu Lakshminarayanan
  • VUFind on a hard disk by Julia Bauder

We dined in the University’s South Dining Hall, and toured a bit of the campus on the way back taking in the “giant marble”, the Architecture Library, and the Dome.

In the afternoon we broke up into smaller groups and discussed things including institutional repositories, mobile devices & interfaces, ePub files, and FRBR. In the evening we enjoyed varieties of North Carolina barbecue, and then retreated to the campus bar (Legend’s) for a few beers.

I’m sorry to say the Code4Lib Challenge was not successful. Us hackers were either to engrossed to notice whether or not anybody came to the event, or nobody showed up to challenge us. Maybe next time.


Day #2

There were fewer participants on Day #2. We spent the time listening to Ken elaborate on the uses and benefits of jQuery. I hacked at something I’m calling “The Great Books Survey”.

The event was successful in that it provided plenty of opportunity to discuss shared problems and solutions. Personally, I learned I need to explore statistical correlations, regressions, multi-varient analysis, and principle component analysis to a greater degree.

A good time was had by all, and it is quite possible the next “Midwest” Regional Meeting will be hosted by the good folks in Chicago.

For more detail about Code4Lib “Midwest”, see the wiki: http://wiki.code4lib.org/index.php/Midwest.

How “great” are the Great Books?

Thursday, June 10th, 2010

In the 1952 a set of books called the Great Books of the Western World was published. It was supposed to represent the best of Western literature and enable the reader to further their liberal arts education. Sixty volumes in all, it included works by Plato, Aristotle, Shakespeare, Milton, Galileo, Kepler, Melville, Darwin, etc. (See Appendix A.) These great books were selected based on the way they discussed a set of 102 “great ideas” such as art, astronomy, beauty, evil, evolution, mind, nature, poetry, revolution, science, will, wisdom, etc. (See Appendix B.) How “great” are these books, and how “great” are the ideas expressed in them?

Given full text versions of these books it would be almost trivial to use the “great ideas” as input and apply relevancy ranking algorithms against the texts thus creating a sort of score — a “Great Ideas Coefficient”. Term Frequency/Inverse Document Frequency is a well-established algorithm for computing just this sort of thing:

relevancy = ( c / t ) * log( d / f )

where:

  • c = number of times a given word appears in a document
  • t = total number of words in a document
  • d = total number of documents in a corpus
  • f = total number of documents containing a given word

Thus, to calculate our Great Ideas Coefficient we would sum the relevancy score for each “great idea” for each “great book”. Plato’s Republic might have a cumulative score of 525 while Aristotle’s On The History Of Animals might have a cumulative score of 251. Books with a larger Coefficient could be considered greater. Given such a score a person could measure a book’s “greatness”. We could then compare the score to the scores of other books. Which book is the “greatest”? We could compare the score to other measurable things such as book’s length or date to see if there were correlations. Are “great books” longer or shorter than others? Do longer books contain more “great ideas”? Are there other books that were not included in the set that maybe should have been included? Instead of summing each relevancy score, maybe the “great ideas” can be grouped into gross categories such as humanities or sciences, and we can sum those scores instead. Thus we may be able to say one set of book is “great” when it comes the expressing the human condition and these others are better at describing the natural world. We could ask ourselves, which number of books represents the best mixture of art and science because their humanities score is almost equal to its sciences score. Expanding the scope beyond general education we could create an alternative set of “great ideas”, say for biology or mathematics or literature, and apply the same techniques to other content such as full text scholarly journal literatures.

The initial goal of this study is to examine the “greatness” of the Great Books, but the ultimate goal is to learn whether or not this quantitative process can be applied other bodies of literature and ultimately assist the student/scholar in their studies/research

Wish me luck.

Appendix A – Authors and titles in the Great Books series

  • AeschylusPrometheus Bound; Seven Against Thebes; The Oresteia; The Persians; The Suppliant Maidens
  • American State PapersArticles of Confederation; Declaration of Independence; The Constitution of the United States of America
  • ApolloniusOn Conic Sections
  • AquinasSumma Theologica
  • ArchimedesBook of Lemmas; Measurement of a Circle; On Conoids and Spheroids; On Floating Bodies; On Spirals; On the Equilibrium of Planes; On the Sphere and Cylinder; The Method Treating of Mechanical Problems; The Quadrature of the Parabola; The Sand-Reckoner
  • AristophanesEcclesiazousae; Lysistrata; Peace; Plutus; The Acharnians; The Birds; The Clouds; The Frogs; The Knights; The Wasps; Thesmophoriazusae
  • AristotleCategories; History of Animals; Metaphysics; Meteorology; Minor biological works; Nicomachean Ethics; On Generation and Corruption; On Interpretation; On Sophistical Refutations; On the Gait of Animals; On the Generation of Animals; On the Motion of Animals; On the Parts of Animals; On the Soul; Physics; Poetics; Politics; Posterior Analytics; Prior Analytics; Rhetoric; The Athenian Constitution; Topics
  • AugustineOn Christian Doctrine; The City of God; The Confessions
  • AureliusThe Meditations
  • BaconAdvancement of Learning; New Atlantis; Novum Organum
  • BerkeleyThe Principles of Human Knowledge
  • BoswellThe Life of Samuel Johnson, LL.D.
  • CervantesThe History of Don Quixote de la Mancha
  • ChaucerTroilus and Criseyde; The Canterbury Tales
  • CopernicusOn the Revolutions of Heavenly Spheres
  • DanteThe Divine Comedy
  • DarwinThe Descent of Man and Selection in Relation to Sex; The Origin of Species by Means of Natural Selection
  • DescartesDiscourse on the Method; Meditations on First Philosophy; Objections Against the Meditations and Replies; Rules for the Direction of the Mind; The Geometry
  • DostoevskyThe Brothers Karamazov
  • EpictetusThe Discourses
  • EuclidThe Thirteen Books of Euclid’s Elements
  • EuripidesAlcestis; Andromache; Bacchantes; Cyclops; Electra; Hecuba; Helen; Heracleidae; Heracles Mad; Hippolytus; Ion; Iphigeneia at Aulis; Iphigeneia in Tauris; Medea; Orestes; Phoenician Women; Rhesus; The Suppliants; Trojan Women
  • FaradayExperimental Researches in Electricity
  • FieldingThe History of Tom Jones, a Foundling
  • FourierAnalytical Theory of Heat
  • FreudA General Introduction to Psycho-Analysis; Beyond the Pleasure Principle; Civilization and Its Discontents; Group Psychology and the Analysis of the Ego; Inhibitions, Symptoms, and Anxiety; Instincts and Their Vicissitudes; New Introductory Lectures on Psycho- Analysis; Observations on “Wild” Psycho-Analysis; On Narcissism; Repression; Selected Papers on Hysteria; The Ego and the Id; The Future Prospects of Psycho-Analytic Therapy; The Interpretation of Dreams; The Origin and Development of Psycho- Analysis; The Sexual Enlightenment of Children; The Unconscious; Thoughts for the Times on War and Death
  • GalenOn the Natural Faculties
  • GalileoDialogues Concerning the Two New Sciences
  • GibbonThe Decline and Fall of the Roman Empire
  • GilbertOn the Loadstone and Magnetic Bodies
  • GoetheFaust
  • HamiltonThe Federalist
  • HarveyOn the Circulation of Blood; On the Generation of Animals; On the Motion of the Heart and Blood in Animals
  • HegelThe Philosophy of History; The Philosophy of Right
  • HerodotusThe History
  • HippocratesWorks
  • HobbesLeviathan
  • HomerThe Iliad; The Odyssey
  • HumeAn Enquiry Concerning Human Understanding
  • JamesThe Principles of Psychology
  • KantExcerpts from The Metaphysics of Morals; Fundamental Principles of the Metaphysic of Morals; General Introduction to the Metaphysic of Morals; Preface and Introduction to the Metaphysical Elements of Ethics with a note on Conscience; The Critique of Judgement; The Critique of Practical Reason; The Critique of Pure Reason; The Science of Right
  • KeplerEpitome of Copernican Astronomy; The Harmonies of the World
  • LavoisierElements of Chemistry
  • LockeA Letter Concerning Toleration; An Essay Concerning Human Understanding; Concerning Civil Government, Second Essay
  • LucretiusOn the Nature of Things
  • MachiavelliThe Prince
  • MarxCapital
  • Marx and EngelsManifesto of the Communist Party
  • MelvilleMoby Dick; or, The Whale
  • MillConsiderations on Representative Government; On Liberty; Utilitarianism
  • MiltonAreopagitica; English Minor Poems; Paradise Lost; Samson Agonistes
  • MontaigneEssays
  • MontesquieuThe Spirit of the Laws
  • NewtonMathematical Principles of Natural Philosophy; Optics; Twelfth Night; or, What You Will
    Christian Huygens
    ; Treatise on Light
  • NicomachusIntroduction to Arithmetic
  • PascalPensées; Scientific and mathematical essays; The Provincial Letters
  • PlatoApology; Charmides; Cratylus; Critias; Crito; Euthydemus; Euthyphro; Gorgias; Ion; Laches; Laws; Lysis; Meno; Parmenides; Phaedo; Phaedrus; Philebus; Protagoras; Sophist; Statesman; Symposium; The Republic; The Seventh Letter; Theaetetus; Timaeus
  • PlotinusThe Six Enneads
  • PlutarchThe Lives of the Noble Grecians and Romans
  • PtolemyThe Almagest
  • RabelaisGargantua and Pantagruel
  • RousseauA Discourse on Political Economy; A Discourse on the Origin of Inequality; The Social Contract
  • ShakespeareA Midsummer-Night’s Dream; All’s Well That Ends Well; Antony and Cleopatra; As You Like It; Coriolanus; Cymbeline; Julius Caesar; King Lear; Love’s Labour’s Lost; Macbeth; Measure For Measure; Much Ado About Nothing; Othello, the Moor of Venice; Pericles, Prince of Tyre; Romeo and Juliet; Sonnets; The Comedy of Errors; The Famous History of the Life of King Henry the Eighth; The First Part of King Henry the Fourth; The First Part of King Henry the Sixth; The Life and Death of King John; The Life of King Henry the Fifth; The Merchant of Venice; The Merry Wives of Windsor; The Second Part of King Henry the Fourth; The Second Part of King Henry the Sixth; The Taming of the Shrew; The Tempest; The Third Part of King Henry the Sixth; The Tragedy of Hamlet, Prince of Denmark; The Tragedy of King Richard the Second; The Tragedy of Richard the Third; The Two Gentlemen of Verona; The Winter’s Tale; Timon of Athens; Titus Andronicus; Troilus and Cressida
  • SmithAn Inquiry into the Nature and Causes of the Wealth of Nations
  • SophoclesAjax; Electra; Philoctetes; The Oedipus Cycle; The Trachiniae
  • SpinozaEthics
  • SterneThe Life and Opinions of Tristram Shandy, Gentleman
  • SwiftGulliver’s Travels
  • TacitusThe Annals; The Histories
  • ThucydidesThe History of the Peloponnesian War
  • TolstoyWar and Peace
  • VirgilThe Aeneid; The Eclogues; The Georgics

Appendix B – The “great” ideas

angel • animal • aristocracy • art • astronomy • beauty • being • cause • chance • change • citizen • constitution • courage • custom & convention • definition • democracy • desire • dialectic • duty • education • element • emotion • eternity • evolution • experience • family • fate • form • god • good & evil • government • habit • happiness • history • honor • hypothesis • idea • immortality • induction • infinity • judgment • justice • knowledge • labor • language • law • liberty • life & death • logic • love • man • mathematics • matter • mechanics • medicine • memory & imagination • metaphysics • mind • monarchy • nature • necessity & contingency • oligarchy • one & many • opinion • opposition • philosophy • physics • pleasure & pain • poetry • principle • progress • prophecy • prudence • punishment • quality • quantity • reasoning • relation • religion • revolution • rhetoric • same & other • science • sense • sign & symbol • sin • slavery • soul • space • state • temperance • theology • time • truth • tyranny • universal & particular • virtue & vice • war & peace • wealth • will • wisdom • world

Not really reading

Wednesday, June 9th, 2010

Using a number of rudimentary digital humanities computing techniques, I tried to practice what I preach and extract the essence from a set of journal articles. I feel like the process met with some success, but I was not really reading.

The problem

A set of twenty-one (21) essays on the future of academic librarianship was recently brought to my attention:

Leaders Look Toward the Future – This site compiled by Camila A. Alire and G. Edward Evans offers 21 essays on the future of academic librarianship written by individuals who represent a cross-section of the field from the largest institutions to specialized libraries.

Since I was too lazy to print and read all of the articles mentioned above, I used this as an opportunity to test out some of my “services against text” ideas.

The solution

Specifically, I used a few rudimentary digital humanities computing techniques to glean highlights from the corpus. Here’s how:

  1. First I converted all of the PDF files to plain text files using a program called pdftotext — a part of xpdf. I then concatenated the whole lot together, thus creating my corpus. This process is left up to you — the reader — as an exercise because I don’t have copyright hutzpah.
  2. Next, I used Wordle to create a word cloud. Not a whole lot of new news here, but look how big the word “information” is compared to the word “collections”.

  3. Using a program of my own design, I then created a textual version of the word cloud listing the top fifty most frequently used words and the number of times they appeared in the corpus. Again, not a whole lot of new news. The articles are obviously about academic libraries, but notice how the word “electronic” is listed and not the word “book”.
  4. Things got interesting when I created a list of the most significant two-word phrases (bi-grams). Most of the things are nouns, but I was struck by “will continue” and “libraries will” so I applied a concordance application to these phrases and got lists of snippets. Some of the more interesting ones include: libraries will be “under the gun” financially, libraries will be successful only if they adapt, libraries will continue to be strapped for staffing, libraries will continue to have a role to play, will continue their major role in helping, will continue to be important, will continue to shift toward digital information, will continue to seek new opportunities.

Yes, there may very well be some subtle facts I missed by not reading the full texts, but I think I got a sense of what the articles discussed. It would be interesting to sit a number of people down, have them read the articles, and then have them list out a few salient sentences. To what degree would their result be the same or different from mine?

I was able to write the programs from scratch, do the analysis, and write the post in about two hours, total. It would have taken me that long to read the articles. Just think what a number of librarians could do, and how much time could be saved if this system were expanded to support just about any plain text data.