Archive for December, 2009

Good and best open source software

Monday, December 28th, 2009

What qualities and characteristics make for a “good” piece of open source software? And once that question is answered, then what pieces of library-related open source software can be considered “best”?

I do not believe there is any single, most important characteristic of open source software that qualifies it to be denoted as “best”. Instead, a number of characteristics need to be considered. For example, a program might do one thing and do it well, but if it is bear to install then that counts against it. Similarly, some software might work wonders but it is built on a proprietary infrastructure such as a closed source compiler. Can that software really be considered “open”?

For my own education and cogitation, I have begun to list questions to help me address what I think is the “best” library-related open source software. Your comments would be greatly appreciated. I have listed the questions in (more or less) priority order:

  • Does the software work as advertised? – If the program says it can do one thing, but never does, then this may be a non-starter. On the other hand, accomplishing a particular goal is sometimes relative. In most cases the software might perform excellently, but in others it performs less so. It is unrealistic to expect any software to be all things to all people.
  • To what degree is the software supported? – Support, can mean many things. Most obviously, users of the software want to know whether or not there are one or more people behind the software who can answer questions about it. Where is the developer and how can I get in touch with them? Are they approachable? If the developer is not available, then can support be purchased? Do I get what I pay for when I make this purchase? How expensive is it? Is their website easy to use? Support can also allude to software updates. “Software is never done. If it were, then it would be called hardware.” For example, my favorite XSL processor (xsltproc) and some of its friends work great but recommending it to friends comes with hesitation because I wonder about ongoing maintenance and upgrades to the newer versions of the API. Support also means user community. While open source is about “free” software, it relies on communities for sustainability. Do such communities exist? Are there searchable mailing lists with browsable archives? Are there wikis, virtual and real meetings, and/or IRC channels, etc?
  • Is the documentation thorough? – Is there a man page? A POD? Something that can be printed and annotated? Is there an introduction? FAQ? Glossary of terms? Is there a different guide/section for different types of readers such as systems administrators, programmers, implementors, and/or users? Is the documentation well-written? While I have used plenty of pieces of software and never read the manual, documentation is essencial if the software is expected to be exploited to the highest degree. Few thing in life are truly intuitive. Software is certainly not one of them. Documentation is a form of writing, and writing is something that literally transcends space and time. It is an alternative to having a person giving you instructions.
  • What are the licence terms? – Personally I place a higher value on the viral nature of a GNU-like license, but BSD-like licenses enable commercial enterprise to a greater degree, and whether I like it or not commercial enterprises are all but necessary in the world I live in. (After all, it enabled the creation of favorite personal computer’s operating system.) At the same time, if the licensing is not GNU-like or BSD-like, then the software is not really open source anyway. Right?
  • To what degree is the software easy to install? – Since installing software is usually not a process that needs to be repeated, a difficult installation can be overlooked. On the other hand, if tweaking kernels, installing a huge number of dependencies, requiring a second piece of obscure software that is not supported is required, then all this counts against an open source software distribution.
  • To what degree is the software implemented using the “standard” LAMP stack? – LAMP is an acronym for Linux, Apache, MySQL, and Perl (or PHP, or Python, or just about any other computer language), and the LAMP stack is/was the basis for many pieces of open source applications. The combination is well-supported, well-documented, and easily transportable to different hardware platforms. If the software application is built on LAMP, then the application has a lot going for it.
  • Is the distribution in question an application/system or a library/module? – It is possible to divide software into two group: 1) software that is designed to build other software — libraries/modules, and 2) software that is an an end-in-itself — applications/systems. The former is akin to a tool in a toolbox used to build applications. The later is something intended for an end user. The former requires a computer programmer to truly exploit. The later usually does not require as much specific expertise. Both the module and the application have their place. Each have their own advantages and disadvantages. Depending on the implementor’s environment one might be better suited.
  • To what degree does the software satisfy some sort of real library need? – This question is specific to my particular audience, and is dependent on a definition of librarianship. Collection. Preservation. Organization. Dissemination. Books? Catalogs? Circulation? Reading and information literacy? Physical place fostering community? Etc. For example, librarians love to create lists, and in a digital environment lists are well managed through the use of relational databases. Therefore, does MySQL qualify as a piece of library-related software? Similarly, as Roy Tennant was told one time, “Librarians like to search. Everybody else likes to find.” Does this mean indexers like Solr/Lucene ought to qualify? Maybe the question ought to be rephrased. “To what degree does the software satisfy your or your institution’s needs?”

What sorts of things have I left out? Is there anything here that can be measurable or is everything left to subjective judgement? Just as importantly, can we as a community answer these questions in the list of specific software distributions to come up with the “best” of class?

‘More questions than answers.

Valencia and Madrid: A Travelogue

Saturday, December 5th, 2009

I recently had the opportunity to visit Valencia and Madrid (Spain) to share some of my ideas about librarianship. This posting describes some of things I saw and learned along the way.

La Capilla de San Francisco de Borja
La Capilla de San Francisco de Borja
Capilla del Santo Cáliz
Capilla del Santo Cáliz

LIS-EPI Meeting

In Valencia I was honored to give the opening remarks at the 4th International LIS-EPI Meeting. Hosted by the Universidad Politécnica de Valencia and organized by Fernanda Mancebo as well as Antonia Ferrer, the Meeting provided an opportunity for librarians to come together and share their experiences in relation to computer technology. My presentation, “A few possibilities for librarianship by 2015” outlined a few near-term futures for the profession. From the introduction:

The library profession is at a cross roads. Computer technology coupled with the Internet have changed the way content is created, maintained, evaluated, and distributed. While the core principles of librarianship (collection, organization, preservation, and dissemination) are still very much apropos to the current milieu, the exact tasks of the profession are not as necessary as they once were. What is a librarian to do? In my opinion, there are three choices: 1) creating services against content as opposed to simply providing access to it, 2) curating collections that are unique to our local institutions, or 3) providing sets of services that are a combination of #1 and #2.

And from the conclusion:

If libraries are representing a smaller and smaller role in the existing information universe, then two choice present themselves. First, the profession can accept this fact, extend it out to its logical conclusion, and see that libraries will eventually play in insignificant role in society. Libraries will not be libraries at all but more like purchasing agents and middle men. Alternatively, we can embrace the changes in our environment, learn how to take advantage of them, exploit them, and change the direction of the profession. This second choice requires a period of transition and change. It requires resources spent against innovation and experimentation with the understanding that innovation and experimentation more often generate failures as opposed to successes. The second option carries with it greater risk but also greater rewards.

toro
toro
robot sculpture
robot sculpture

Josef Hergert

Providing a similar but different vision from my own, Josef Hergert (University of Applied Sciences HTW Chur) described how librarianship ought to be embracing Web 2.0 techniques in a presentation called “Learning and Working in Time of Web 2.0: Reconstructing Information and Knowledge”. To say Hergert was advocating information literacy would be to over-simplify his remarks, yet if you broaden the definition of information literacy to include the use of blogs, wikis, social bookmarking sites — Web 2.0 technologies — then the phrase information literacy is right on target. A number of notable quotes included:

  • We are experiencing many changes in the environment: non-commercial sharing of content, legislative overkill, and “pirate parties”… The definition of “authorship” is changing.
  • The teaching of information literacy courses will help overcome some of the problems.
  • The process of learning is changing because of the Internet… We are now experiencing a greater degree of informal learning as opposed to formal learning… We need as librarians to figure out how to exploit the environment to support learning both formal and informal.
  • The current environment is more than paper, but also about a network of people, and the librarian can help create these networks with [Web 2.0 tools].
  • Provide not only the book but the environment and tools to do the work.

As an aside, I have been using networked computer technologies for more than twenty years. Throughout that time a number of truisms have become apparent. “If you don’t want it copied, then don’t put it on the ‘Net; give back to the ‘Net”, “On the Internet nobody knows that you are a dog”, and “It is like trying to drink from a fire hose” are just a few. Hergert used the newest one, “If it is not on the Internet, then it doesn’t exist.” For better or for worse, I think this is true. Convenience is a very powerful elixer. The ease of acquiring networked data and information is so great compared the time and energy needed to get data and information in analog format that people will get what is simple “good enough”. In order to remain relevant, libraries must put their (full text) content on the ‘Net or be seen as an impediment to learning as opposed to learning’s facilitator.

While I would have enjoyed learning what the other Meeting presenters has to say, it was unrealistic for me to attend the balance of the conference. The translators were going back to Switzerland, and I would not have been able to understand what the presenters were saying. In this regard is sort of felt like the Ugly American, but I have come to realize that the use of English is a purely practical matter. It as nothing to do with a desire to understand American culture.

Bibliteca Valenciana

The next day I have a few others had the extraordinary opportunity to get an inside tour of the Bibliteca Valenciana (Valencia Library). Starting out as a monastery, it was transformed into quite a number of other things, such as a prison, before it became a library. We got to go into the archives, see of of their treasures, and learn about the library’s history. They were very proud of their Don Quixote collection, and we saw their oldest book — a treatise on the Black Death which included receipts for treatments.

Biblioteca Nacional de España

In Madrid I believe visited the Biblioteca Nacional de España (National Library of Spain) and went to their museum. It was free, and I saw an exhibition of original Copernicus, Galileo, Brahe, Kepler, and Newton editions embodying Western scientific progress. Very impressive, and very well done, especially considering the admission fee.

Biblioteca Nacional de España
Biblioteca Nacional
statue
statue

International Institute

Finally, I shared the presentation from the LIS-EPI Meeting at the International Institute. While I advocated changes in the way’s our profession do its work, the attendees at both venues wondered how to about these changes. “We are expected to provide a certain set of services to our patrons here and now. What do we do to learn these new skills?” My answer was grounded in applied research & development. Time must be spent experimenting and “playing” with the new technologies. This should be considered an investment in the profession and its personnel, an investment that will pay off later in new skills and greater flexibility. We work in academia. It behooves us to work academically. This includes explorations into applying our knowledge in new and different ways.

Acknowledgements

Many thanks go to many people for making this professional adventure possible. I am indebted to Monica Pareja from the United Stated Embassy in Madrid. She kept me out of trouble. I thank Fernanda Mancebo and Antonia Ferrer who invited me to the Meeting. Last and certainly not least, I thank my family for allowing to to go to Spain in the first place since the event happened over the Thanksgiving holiday. “Thank you, one and all.”

alley
alley
fountain
fountain

Colloquium on Digital Humanities and Computer Science: A Travelogue

Friday, December 4th, 2009

On November 14-16, 2009 I attended the 4th Annual Chicago Colloquium on Digital Humanities and Computer Science at the Illinois Institute of Technology in Chicago. This posting outlines my experiences there, but in a phrase, I found the event to be very stimulating. In my opinion, libraries ought to be embracing the techniques described here and integrating them into their collections and services.

IIT
IIT
Paul Galvin Library
Paul Galvin Library

Day #0 – A pre-conference workshop

Upon arrival I made my way directly to a pre-conference workshop entitled “Machine Learning, Sequence Alignment, and Topic Modeling at ARTFUL” presented by Mark Olsen and Clovis Gladstone. In the workshop they described at least two applications they were using to discover common phrases between texts. The first was called Philomine and the second was called Text::Pair. Both work similarly but Philomine needs to be integrated with Philologic, and Text::Pair is a stand-alone Perl module. Using these tools n-grams are extracted from texts, indexed to the file system, and await searching. By entering phrases into a local search engine, hits are returned that include the phrases and the works where the phrase was found. I believe Text::Pair could be successfully integrated in my Alex Catalogue.

orange, green, and gray
orange, green, and gray
orange and green
orange and green

Day #1

The Colloquium formally began the next day with an introduction by Russell Betts (Illinois Institute of Chicago). His most notable quote was, “We have infinite computer power at our fingertips, and without much thought you can create an infinite amount of nonsense.” Too true.

Marco Büchler (University of Leipzig) demonstrated textual reuse techniques in a presentation called “Citation Detection and Textual Reuse on Ancient Greek Texts”. More specifically, he used textual reuse to highlight differences between texts, graph ancient history, and explore computer science algorithms. Try www.eaqua.net for more.

Patrick Juola‘s (Duquesne University) “conjecturator” was the heart of the next presentation called “Mapping Genre Spaces via Random Conjectures”. In short, Juola generated thousands and thousands of “facts” in the form of [subject1] uses [subject2] more or less than [subject3]. He then tested each of these facts for truth against a corpus. Ironically, he was doing much of what Betts alluded to in the introduction — creating nonsense. On the other hand, the approach was innovative.

By exploiting a parts-of-speech (POS) parser, Devin Griffiths (Rutgers University) sought the use of analogies as described in “On the Origin of Theories: The Semantic Analysis of Analogy in Scientific Corpus”. Assuming that an analogy can be defined as a noun-verb-noun-conjunction-noun-verb-noun phrase, Griffith looked for analogies in Darwin’s Origin of Species, graphed the number of analogies against locations in the text, and made conclusions accordingly. He asserted that the use of analogy was very important during the Victorian Age, and he tried to demonstrate this assertion through a digital humanities approach.

The use of LSIDs (large screen information displays) was discussed by Geoffrey Rockwell (McMaster University). While I did not take a whole lot of notes from this presentation, I did get a couple of ideas: 1) figure out a way for a person to “step into” a book, or 2) display a graphic representation of a text on a planetarium ceiling. Hmm…

Kurt Fendt (MIT) described a number of ways timelines could be used in the humanities in his presentation called “New Insights: Dynamic Timelines in Digital Humanities”. Through the process I became aware of the SIMILE timeline application/widget. Very nice.

I learned of the existence of a number of digital humanities grants as described by Michael Hall (NEH). They are both start-up grants as well a grants on advanced topics. See: neh.gov/odh/.

The first keynote speech, “Humanities as Information Sciences”, was given by Vasant Honavar (Iowa State University) in the afternoon. Honavar began with a brief history of thinking and philosophy, which he believes lead to computer science. “The heart of information processing is taking one string and transforming it into another.” (Again, think the introductory remarks.) He advocated the creation of symbols, feeding them into a processor, and coming up with solutions out the other end. Language, he posited, is an information-rich artifact and therefore something that can be analyzed with computing techniques. I liked how he compared science with the humanities. Science observes physical objects, and the humanities observe human creations. Honavar was a bit arscient, and therefore someone to be admired.

subway tunnel
subway tunnel
skyscraper predecessor
skyscraper predecessor

Day #2

In “Computational Phonostylistics: Computing the Sounds of Poetry” Marc Plamondon (Nipissing University) described how he was counting phonemes in both Tennyson’s and Browning’s poetry to validate whether or not Tennyson’s poetry is “musical” or plosive sounding and Browning’s poetry is “harsh” or fricative. To do this he assumed one set of characters are soft and another set are hard. He then counted the number of times each of these sets of characters existed in each of the respective poets’ works. The result was a graph illustrating the “musical” or “harshness” of the poetry. One of the more interesting quotes from Plamondon’s presentation included, “I am interested in quantifying aesthetics.”

In C.W. Forstal‘s (SUNY Buffalo) presentation “Features from Frequency: Authorship and Stylistic Analysis Using Repetitive Sound” we learned how he too is counting sound n-grams to denote style. He applied the technique to D.H. Lawrence as well as to the Iliad and Odyssey, and to his mind the technique works to his satisfaction.

The second keynote presentation was give by Stephen Wolfram (Wolfram Research) via teleconference. It was called “What Can Be Made Computable in the Humanities?” He began by describing Mathematica as a tool he used to explore the world around him. All of this assumes that the world consists of patterns, and these patterns can be described through the use of numbers. He elaborated through something he called the Principle of Computational Equivalency — once you get to a certain threshold systems create a level of complexity. Such a principle puts pressure on having the simplest descriptive model as possible. (Such things are standard scientific/philosophic principles. Nothing new here.) Looking for patterns was the name of his game, and one such game was applied to music. Discover the patterns in a type of music. Feed the patterns to a computer. Have the computer generate the music. Most of the time the output works pretty well. He called this WolframTones. He went on to describe WolframAlpha as an attempt to make the world’s knowledge computable. Essentially a front-end to Mathematica, WolframAlpha is a vast collection of content associated with numbers: people and their birth dates, the agriculture output of countries, the price of gold over time, temperatures from across the world, etc. Queries are accepted into the system. Searches are done against its content. Results are returned in the form of best-guess answers complete with graphs and charts. WolframAlpha exposes mathematical processing to the general public in ways that have not been done previously. Wolfram described two particular challenges in the creation of WolframAlpha. One was the collection of content. Unlike Google, Wolfram Research does not necessarily crawl the Internet. Rather it selectively collects the content of a “reference library” and integrates it into the system. Second, and more challenging, has been the design of the user interface. People do not enter structured queries, but structured output is expected. Interpreting people’s input is a difficult task in and of itself. From my point of view, he is probably learning more about human thought processes than the natural world.

red girder sculpture
red girder sculpture
gray sculpture
gray sculpture

Some thoughts

This meeting was worth every single penny, especially considering the fact that there was absolutely no registration fee. Free, except of the my travel costs, hotel, and the price of the banquet. Unbelievable!

Just as importantly, the presentations given at this meeting demonstrate the maturity of the digital humanities. These things are not just toys but practical tools for evaluating (mostly) texts. Given the increasing amount of full text available in library collections, I see very little reason why these sorts of digital humanities applications could not be incorporate into library collections and services. Collect full text content. Index it. Provide access to the index. Get back a set of search results. Select one or more items. Read them. Select one or more items again, and then select an option such as graph analogies, graph phonemes, or list common phrases between texts. People need to do more than read the texts. People need to use the texts, to analyze them, to compare & contrast them with other texts. The tools described in this conference demonstrate that such things are more than possible. All that has to be done is to integrate them into our current (library) systems.

So many opportunities. So little time.