This week, we are exploring databases in RCDH. This is an area of great interest to me to wonder/wander about, but to situate myself, I focused on reading on “The Notion of the Textbase: Design and Use of Textbases in the Humanities” by Charles Conney, Glenn Roe and Mark Olsen. I thought this reading would help me understand more about the role of metadata in textbase design and its effect on differing searchability. When I say I’m interested in databases, my knowledge of them is limited. I understand, at a very surface level, the word searching feature of databases working somehow with an algorithm that calls upon available text metadata through bibliographic and keyword information. I don’t understand the complexity of the algorithm at work, nor can I see “behind the scenes” of the database to know what text collection it is calling from and based on what criterion. My interest in databases is in better understanding these invisible operations and features, so my searching isn’t limited to
and wondering how the tags are created, and what information from the text is being utilized for researching. Additionally, I had questions if there existed algorithms or schemes that were more exploratory in research; that is, ones that don’t seem to necessitate having search results or knowledge available before research begins.
To begin, textbases are a coherent collection of partly structured or unstructured digital documents that are assembled around unifying principles as thematic or generically similar. Cooney et al. define their goal as defining a selection of humanities databases built for textual analysis with particular attention to design principles that underlie their structure and inform their use. They argue that as textbases continue to develop, tools can offer alternatives beyond word searching which might allow the discovery of unseen connections across/between texts, the capability to trace ideas over a large collection or period, and to identify contextual and intertextual relationships of an individual text to any number of other texts. They also describe that textbases, utilized by digital humanities and electronic publishing, are hybrids of publishing and scholarly models that take up the concerns of presentation of text and the capability to support computer search through encoding. This design comes in three varieties: heavily encoded internal notation (works to preserve texts as historical documents), reliance on relational databases to manage metadata (works to understand texts and their creation and dissemination), and those of minimal level of markup (examines word use over large collections). The markup, or metadata, provides leverage or a handle for refining searches and permitting more complex retrieval tasks beyond a word search; so, that of author, title, date, genre, geographic area, historical context, etc. The amount of markup, in my understanding, can vary on a spectrum of retrieval to representation; this variance is due to the use of the texts – that of electronic publication or digital humanities research projects. Textbases created for textual analysis purposes are constructed to be used with computer assistance—the needs of “mundane” humanities research that looks across information at a massive scale quickly have les metadata encoded. Mixed mode textbases focus on word based data retrieval; indexing metadata is externally handled and associated with complex retrieval tasks to locate texts. These both operate on the automatic discovery of textual patterns based on word centered search, which is becoming increasingly limiting as text digitization and texbases grow – it reaches a threshold in navigability. Conney et al. describe developing heuristic tools – machine learning, text mining, similarity algorithms, and clustering – that take form of one of three machine learning techniques: predictive, comparative, and clustering or similarity. Predictive and comparative techniques are systems designed to identify patterns or features associated with predetermined classes to distinguish categories in texts. They provide the example, though limited in its binary, of the algorithm that sorts spam email from non-spam email as distinguishing categories in a collection of texts. Clustering and similarity techniques work to identify documents or parts of documents that are most similar to one another rather than beginning from predetermined classes. They provide the example of this algorithm that identifies broad topics like Amazon recommendations. These are sensitive to smaller classes and functions with less orderly schemes, which are described as more “suited to the human” at the scale of a more defined collection. Conney et al. go into specific examples of intertextuality and similarity at work, which remain over my head for the time being in terms of accounting for the understanding of vector space models and n-gram word sequences, but end with the burden textbases will require attention to as the continue to grow—specialized language, specific encoding, and dedicated technical support to their creation and maintenance. They state that textbases will continue to improve to become better suited for humanities research in terms of sufficient tools, more narrow collections, and more focused research capabilities within text collections across entire networks.
While I understand better some of the language needed to talk about textbases and more intricate details about their scales of function, I’m still left wondering:
? if there are more textbase designs then these emphasized
? how the collections are formed and access to networks collections have
? how the exploratory and heuristic researching functions
? the relationship between metadata and patterns through textbase research
? if there are available textbases to explore (open source)
? are these at work in our scholarly publication and library databases