Archive for the ‘Archives’ Category

Create and Manage Data: Training Resources

Thursday, May 2nd, 2013

Create and Manage Data: Training Resources

From the webpage:

Our Managing and Sharing Data: Training Resources present a suite of flexible training materials for people who are charged with training researchers and research support staff in how to look after research data.

The Training Resources complement the UK Data Archive’s popular guide on ‘Managing and Sharing Data: best practice for researchers’, the most recent version published in May 2011.

They  have been designed and used as part of the Archive’s daily work in supporting ESRC applicants and award holders and have been made possible by a grant from the ESRC Researcher Development Initiative (RDI).

The Training Resources are modularised following the UK Data Archive’s seven key areas of managing and sharing data:

  • sharing data – why and how
  • data management planning for researchers and research centres
  • documenting data
  • formatting data
  • storing data, including data security, data transfer, encryption, and file sharing
  • ethics and consent
  • data copyright

Each section contains:

  • introductory powerpoint(s)
  • presenter’s guide – where necessary
  • exercises and introduction to exercises
  • quizzes
  • answers

The materials are presented as used in our own training courses  and are mostly geared towards social scientists. We anticipate trainers will create their own personalised and more context-relevant example, for example by discipline, country, relevant laws and regulations.

You can download individual modules from the relevant sections or download the whole resource in pdf format. Updates to pages were last made on 20 June 2012.

Download all resources.

Quite an impressive set of materials that will introduce you to some aspects of research data in the UK. Not all but some aspects.

What you don’t learn here you will pickup from interaction with people actively engaged with research data.

But it will give you a head start on understanding the research data community.

Unlike some technologies, topic maps are more about a community’s world view than the world view of topic maps.

Springer Book Archives [Proposal for Access]

Tuesday, April 9th, 2013

The Springer Book Archives now contain 72,000 titles

From the post:

Today at the British UKSG Conference in Bournemouth, Springer announced that the Springer Book Archives (SBA) now contain 72,000 eBooks. This news represents the latest developments in a project that seeks to digitize nearly every Springer book ever published, dating back to 1842 when the publishing company was founded. The titles are being digitized and made available again for the scientific community through SpringerLink (link.springer.com), Springer’s online platform.

By the end of 2013 an unprecedented collection of around 100,000 historic, scholarly eBooks, in both English and German, will be available through the SBA. Researchers, students and librarians will be able to access the full text of these books free of any digital rights management. Springer also offers a print-on-demand option for most of the books.

Notable authors whose works Springer has published include high-level researchers and Nobel laureates, such as Werner von Siemens, Rudolf Diesel, Emil Fischer and Marie Curie.Their publications will be a valuable addition to this historic online archive.

SBA section at Springer: http://www.springer.com/bookarchives

A truly remarkable achievement but access will remain problematic for a number of potential users.

I would like to see the United States government purchase (as in pay an annual fee) unlimited access to SpringerLink for any U.S. based IP address.

Springer gets more revenue than it does now from U.S. domains, reduces Springer’s licensing costs, benefits all colleges and universities, and provides everyone in the U.S. with access to first rate technical publications.

Not to mention that Springer gets the revenue from selling the print-on-demand paperback editions.

Seems like a no-brainer if you are looking to jump start a knowledge economy.

PS: Forward this to your Senator/Representative. Could be a viable model to satisfy the needs of publishers and readers.

I first saw this at: Springer Book Archives Now Contain 72,000 eBooks by Africa S. Hands.

… Preservation and Stewardship of Scholarly Works, 2012 Supplement

Tuesday, March 19th, 2013

Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement by Charles W. Bailey, Jr.

From the webpage:

In a rapidly changing technological environment, the difficult task of ensuring long-term access to digital information is increasingly important. The Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement presents over 130 English-language articles, books, and technical reports published in 2012 that are useful in understanding digital curation and preservation. This selective bibliography covers digital curation and preservation copyright issues, digital formats (e.g., media, e-journals, research data), metadata, models and policies, national and international efforts, projects and institutional implementations, research studies, services, strategies, and digital repository concerns.

It is a supplement to the Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, which covers over 650 works published from 2000 through 2011. All included works are in English. The bibliography does not cover conference papers, digital media works (such as MP3 files), editorials, e-mail messages, letters to the editor, presentation slides or transcripts, unpublished e-prints, or weblog postings.

The bibliography includes links to freely available versions of included works. If such versions are unavailable, italicized links to the publishers' descriptions are provided.

Links, even to publisher versions and versions in disciplinary archives and institutional repositories, are subject to change. URLs may alter without warning (or automatic forwarding) or they may disappear altogether. Inclusion of links to works on authors' personal websites is highly selective. Note that e-prints and published articles may not be identical.

The bibliography is available under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

Supplement to “the” starting point for research on digital curation.

Research Data Symposium – Columbia

Saturday, March 9th, 2013

Research Data Symposium – Columbia.

Posters from the Research Data Symposium, held at Columbia University, February 27, 2013.

Subject to the limitations of the poster genre but useful as a quick overview of current projects and directions.

usenet-legend

Sunday, February 24th, 2013

usenet-legend by Zach Beane

From the description:

This is Usenet Legend, an application for producing a searchable archive of an author’s comp.lang.lisp history from Ron Garrett’s large archive dump.

Zach mentions this in his post The Rob Warnock Lisp Usenet Archive but I thought it needed a separate post.

Making content more navigable is always a step in the right direction.

Research Data Curation Bibliography

Wednesday, January 16th, 2013

Research Data Curation Bibliography (version 2) by Charles W. Bailey.

From the introduction:

The Research Data Curation Bibliography includes selected English-language articles, books, and technical reports that are useful in understanding the curation of digital research data in academic and other research institutions. For broader coverage of the digital curation literature, see the author's Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works,which presents over 650 English-language articles, books, and technical reports.

The "digital curation" concept is still evolving. In "Digital Curation and Trusted Repositories: Steps toward Success," Christopher A. Lee and Helen R. Tibbo define digital curation as follows:

Digital curation involves selection and appraisal by creators and archivists; evolving provision of intellectual access; redundant storage; data transformations; and, for some materials, a commitment to long-term preservation. Digital curation is stewardship that provides for the reproducibility and re-use of authentic digital data and other digital assets. Development of trustworthy and durable digital repositories; principles of sound metadata creation and capture; use of open standards for file formats and data encoding; and the promotion of information management literacy are all essential to the longevity of digital resources and the success of curation efforts.

This bibliography does not cover conference papers, digital media works (such as MP3 files), editorials, e-mail messages, interviews, letters to the editor, presentation slides or transcripts, unpublished e-prints, or weblog postings. Coverage of technical reports is very selective.

Most sources have been published from 2000 through 2012; however, a limited number of earlier key sources are also included. The bibliography includes links to freely available versions of included works. If such versions are unavailable, italicized links to the publishers' descriptions are provided.

Such links, even to publisher versions and versions in disciplinary archives and institutional repositories, are subject to change. URLs may alter without warning (or automatic forwarding) or they may disappear altogether. Inclusion of links to works on authors' personal websites is highly selective. Note that e prints and published articles may not be identical.

An archive of prior versions of the bibliography is available.

If you are a beginning library student, take the time to know the work of Charles Bailey. He has consistently made a positive contribution for researchers from very early in the so-called digital revolution.

To the extent that you want to design topic maps for data curation, long or short term, the 200+ items in this bibliography will introduce you to some of the issues you will be facing.

Stop Hosting Data and Code on your Lab Website

Thursday, January 10th, 2013

Stop Hosting Data and Code on your Lab Website by Stephen Turner.

From the post:

It’s happened to all of us. You read about a new tool, database, webservice, software, or some interesting and useful data, but when you browse to http://instititution.edu/~home/professorX/lab/data, there’s no trace of what you were looking for.

THE PROBLEM

This isn’t an uncommon problem. See the following two articles:

Schultheiss, Sebastian J., et al. “Persistence and availability of web services in computational biology.” PLoS one 6.9 (2011): e24914. 

Wren, Jonathan D. “404 not found: the stability and persistence of URLs published in MEDLINE.” Bioinformatics 20.5 (2004): 668-672.

The first gives us some alarming statistics. In a survey of nearly 1000 web services published in the Nucleic Acids Web Server Issue between 2003 and 2009:

  • Only 72% were still available at the published address.
  • The authors could not test the functionality for 33% because there was no example data, and 13% no longer worked as expected.
  • The authors could only confirm positive functionality for 45%.
  • Only 274 of the 872 corresponding authors answered an email.
  • Of these 78% said a service was developed by a student or temporary researcher, and many had no plan for maintenance after the researcher had moved on to a permanent position.

The Wren et al. paper found that of 1630 URLs identified in Pubmed abstracts, only 63% were consistently available. That rate was far worse for anonymous login FTP sites (33%).

Is this a problem for published data in the topic map community?

What data should we be archiving? Discussion lists? Blogs? Public topic maps?

What do you think of Stephen’s solution?

D-Lib

Monday, November 19th, 2012

D-Lib

From the about page:

D-Lib Magazine is an electronic publication with a focus on digital library research and development, including new technologies, applications, and contextual social and economic issues. D-Lib Magazine appeals to a broad technical and professional audience. The primary goal of the magazine is timely and efficient information exchange for the digital library community to help digital libraries be a broad interdisciplinary field, and not a set of specialties that know little of each other.

I am about to post concerning an article in D-Lib and realized I don’t have a blog entry on D-Lib!

Not that it is topic map specific but it is digital library specific, with all the issues that entails. Remarkably similar to the issues any topic map author or software will face.

D-Lib has proven what many of us suspected:

The quality of content is not related to the medium of delivery.

Enjoy!

R mailing lists archive

Friday, November 2nd, 2012

R mailing lists archive

From the webpage:

This is an archive of the four main R mailing lists, R-announce, R-packages, R-help and R-devel, as well as the New Zealand and Australian list R-downunder. The archive is automatically updated multiple times a day, so anything posted to the list should be in the archive within about 2 hours….

I saw this on a Twitter feed from One R Tip a Day.

A good example of current email archiving practice. A baseline that you need to exceed (at least) in order to interest users.

What would you do differently?

How would you create a topic map for such a resource?

In what ways would your topic map exceed the capabilities seen here?

Europeana opens up data on 20 million cultural items

Thursday, September 13th, 2012

Europeana opens up data on 20 million cultural items by Jonathan Gray (Open Knowledge Foundation):

From the post:

Europe‘s digital library Europeana has been described as the ‘jewel in the crown’ of the sprawling web estate of EU institutions.

It aggregates digitised books, paintings, photographs, recordings and films from over 2,200 contributing cultural heritage organisations across Europe – including major national bodies such as the British Library, the Louvre and the Rijksmuseum.

Today [Wednesday, 12 September 2012] Europeana is opening up data about all 20 million of the items it holds under the CC0 rights waiver. This means that anyone can reuse the data for any purpose – whether using it to build applications to bring cultural content to new audiences in new ways, or analysing it to improve our understanding of Europe’s cultural and intellectual history.

This is a coup d’etat for advocates of open cultural data. The data is being released after a grueling and unenviable internal negotiation process that has lasted over a year – involving countless meetings, workshops, and white papers presenting arguments and evidence for the benefits of openness.

That is good news!

A familiar issue that it overcomes:

To complicate things even further, many public institutions actively prohibit the redistribution of information in their catalogues (as they sell it to – or are locked into restrictive agreements with – third party companies). This means it is not easy to join the dots to see which items live where across multiple online and offline collections.

Oh, yeah! That was one of Google’s reasons for pulling the plug on the Open Knowledge Graph. Google had restrictive agreements so you can only connect the dots with Google products. (I think there is a name for that, let me think about it. Maybe an EU prosecutor might know it. You could always ask.)

What are you going to be mapping from this collection?

Linked Data in Libraries, Archives, and Museums

Tuesday, September 11th, 2012

Linked Data in Libraries, Archives, and Museums Information Standards Quarterly (ISQ) Spring/Summer 2012, Volume 24, no. 2/3 http://dx.doi.org/10.3789/isqv24n2-3.2012.

Interesting reading on linked data.

I have some comments on the “discovery” of the need to manage “diverse, heterogeneous metadata” but will save them for another post.

From the “flyer” that landed in my inbox:

The National Information Standards Organization (NISO) announces the publication of a special themed issue of the Information Standards Quarterly (ISQ) magazine on Linked Data for Libraries, Archives, and Museums. ISQ Guest Content Editor, Corey Harper, Metadata Services Librarian, New York University has pulled together a broad range of perspectives on what is happening today with linked data in cultural institutions. He states in his introductory letter, “As the Linked Data Web continues to expand, significant challenges remain around integrating such diverse data sources. As the variance of the data becomes increasingly clear, there is an emerging need for an infrastructure to manage the diverse vocabularies used throughout the Web-wide network of distributed metadata. Development and change in this area has been rapidly increasing; this is particularly exciting, as it gives a broad overview on the scope and breadth of developments happening in the world of Linked Open Data for Libraries, Archives, and Museums.”

The feature article by Gordon Dunsire, Corey Harper, Diane Hillmann, and Jon Phipps on Linked Data Vocabulary Management describes the shift in popular approaches to large-scale metadata management and interoperability to the increasing use of the Resource Description Framework to link bibliographic data into the larger web community. The authors also identify areas where best practices and standards are needed to ensure a common and effective linked data vocabulary infrastructure.

Four “in practice” articles illustrate the growth in the implementation of linked data in the cultural sector. Jane Stevenson in Linking Lives describes the work to enable structured and linked data from the Archives Hub in the UK. In Joining the Linked Data Cloud in a Cost-Effective Manner, Seth van Hooland, Ruben Verborgh, and Rik Van de Walle show how general purpose Interactive Data Transformation tools, such as Google Refine, can be used to efficiently perform the necessary task of data cleaning and reconciliation that precedes the opening up of linked data. Ted Fons, Jeff Penka, and Richard Wallis discuss OCLC’s Linked Data Initiative and the use of Schema.org in WorldCat to make library data relevant on the web. In Europeana: Moving to Linked Open Data , Antoine Isaac, Robina Clayphan, and Bernhard Haslhofer explain how the metadata for over 23 million objects are being converted to an RDF-based linked data model in the European Union’s flagship digital cultural heritage initiative.

Jon Voss provides a status on Linked Open Data for Libraries, Archives, and Museums (LODLAM) State of Affairs and the annual summit to advance this work. Thomas Elliott, Sebastian Heath, John Muccigrosso Report on the Linked Ancient World Data Institute, a workshop to further the availability of linked open data to create reusable digital resources with the classical studies disciplines.

Kevin Ford wraps up the contributed articles with a standard spotlight article on LC’s Bibliographic Framework Initiative and the Attractiveness of Linked Data. This Library of Congress-led community effort aims to transition from MARC 21 to a linked data model. “The move to a linked data model in libraries and other cultural institutions represents one of the most profound changes that our community is confronting,” stated Todd Carpenter, NISO Executive Director. “While it completely alters the way we have always described and cataloged bibliographic information, it offers tremendous opportunities for making this data accessible and usable in the larger, global web community. This special issue of ISQ demonstrates the great strides that libraries, archives, and museums have already made in this arena and illustrates the future world that awaits us.”

….Comparing Digital Preservation Glossaries [Why Do We Need Common Vocabularies?]

Friday, August 10th, 2012

From AIP to Zettabyte: Comparing Digital Preservation Glossaries

Emily Reynolds (2012 Junior Fellow) writes:

As we mentioned in our introductory post last month, the OSI Junior Fellows are working on a project involving a draft digital preservation policy framework. One component of our work is revising a glossary that accompanies the framework. We’ve spent the last two weeks poring through more than two dozen glossaries relating to digital preservation concepts to locate and refine definitions to fit the terms used in the document.

We looked at dictionaries from well-established archival entities like the Society of American Archivists, as well as more strictly technical organizations like the Internet Engineering Task Force. While some glossaries take a traditional archival approach, others were more technical; we consulted documents primarily focusing on electronic records, archives, digital storage and other relevant fields. Because of influential frameworks like the OAIS Reference Model, some terms were defined similarly across the glossaries that we looked at. But the variety in the definitions for other terms points to the range of practitioners discussing digital preservation issues, and highlights the need for a common vocabulary. Based on what we found, that vocabulary will have to be broadly drawn and flexible to meet different kinds of requirements.

OSI = Office of Strategic Initiatives (Library of Congress)

Not to be overly critical, but I stumble over:

Because of influential frameworks like the OAIS Reference Model, some terms were defined similarly across the glossaries that we looked at. But the variety in the definitions for other terms points to the range of practitioners discussing digital preservation issues, and highlights the need for a common vocabulary.

Why does a “variety in the definitions for other terms…highlight[s] the need for a common vocabulary?”

I take it as a given that we have diverse vocabularies.

And that attempts at “common” vocabularies succeed in creating yet another “diverse” vocabulary.

So, why would anyone looking at “diverse” vocabularies jump to the conclusion that a “common” vocabulary is required?

Perhaps what is missing is the definition of the problem presented by “diverse” vocabularies.

Hard to solve a problem if you don’t know it is. (Hasn’t stopped some people that I know but that is a story for another day.)

I put it to you (and in your absence I will answer, so answer quickly):

What is the problem (or problems) presented by diverse vocabularies? (Feel free to use examples.)

Or if you prefer, Why do we need common vocabularies?

Citizen Archivist Dashboard ["...help the next person discover that record"]

Sunday, June 10th, 2012

Citizen Archivist Dashboard

What’s the common theme of these interfaces from the National Archives (United States)?

  • Tag – Tagging is a fun and easy way for you to help make National Archives records found more easily online. By adding keywords, terms, and labels to a record, you can do your part to help the next person discover that record. For more information about tagging National Archives records, follow “Tag It Tuesdays,” a weekly feature on the NARAtions Blog. [includes "missions" (sets of materials for tagging), rated as "beginner," "intermediate," and "advanced." Or you can create your own mission.]
  • Transcribe – By contributing to transcriptions, you can help the National Archives make historical documents more accessible. Transcriptions help in searching for the document as well as in reading and understanding the document. The work you do transcribing a handwritten or typed document will help the next person discover and use that record.

    The transcription tool features over 300 documents ranging from the late 18th century through the 20th century for citizen archivists to transcribe. Documents include letters to a civil war spy, presidential records, suffrage petitions, and fugitive slave case files.

    [A pilot project with 300 documents but one you should follow. Public transcription (crowd-sourced if you want the popular term) of documents has the potential to open up vast archives of materials.]

  • Edit Articles – Our Archives Wiki is an online space for researchers, educators, genealogists, and Archives staff to share information and knowledge about the records of the National Archives and about their research.

    Here are just a few of the ways you may want to participate:

    • Create new pages and edit pre-existing pages
    • Share your research tips
    • Store useful information discovered during research
    • Expand upon a description in our online catalog

    Check out the “Getting Started” page. When you’re ready to edit, you’ll need to log in by creating a username and password.

  • Upload & Share – Calling all researchers! Start sharing your digital copies of National Archives records on the Citizen Archivist Research group on Flickr today.

    Researchers scan and photograph National Archives records every day in our research rooms across the country — that’s a lot of digital images for records that are not yet available online. If you have taken scans or photographs of records you can help make them accessible to the public and other researchers by sharing your images with the National Archives Citizen Archivist Research Group on Flickr.

  • Index the Census – Citizen Archivists, you can help index the 1940 census!

    The National Archives is supporting the 1940 census community indexing project along with other archives, societies, and genealogical organizations. The release of the decennial census is one of the most eagerly awaited record openings. The 1940 census is available to search and browse, free of charge, on the National Archives 1940 Census web site. But, the 1940 census is not yet indexed by name.

    You can help index the 1940 census by joining the 1940 census community indexing project. To get started you will need to download and install the indexing software, register as an indexing volunteer, and download a batch of images to transcribe. When the index is completed, the National Archives will make the named index available for free.

The common theme?

The tagging entry sums it up with: “…you can do your part to help the next person discover that record.”

That’s the “trick” of topic maps. Once a fact about a subject is found, you can preserve your “finding” for the next person.

ArcSpread for analyzing web archives

Friday, April 27th, 2012

ArcSpread for analyzing web archives

Pete Warden writes:

Stanford runs a fantastic project for capturing important web pages as they change over time, and then presenting the results in a form that future historians will be able to use. This paper talks about some of the techniques they use for removing boilerplate navigation and ad content, so that researchers can work with the meat of the page.

I was relieved to read:

We did not excise any advertising images from the presented pages, but asked participants to disregard advertising related images.

Poorly done digital newspaper archives remove advertising content on a “meat of the page” theory.

Researchers cannot notice what was advertised, how and at what prices. Ads may not interest us, but may interest others.

At one time thousands if not hundreds of thousands of people knew how Egyptian pyramids were build.

So commonly known it was not written down.

Perhaps there is a lesson there for us.

Einstein Archives Online

Thursday, March 22nd, 2012

Einstein Archives Online

From the “about” page:

The Einstein Archives Online Website provides the first online access to Albert Einstein’s scientific and non-scientific manuscripts held by the Albert Einstein Archives at the Hebrew University of Jerusalem, constituting the material record of one of the most influential intellects in the modern era. It also enables access to the Einstein Archive Database, a comprehensive source of information on all items in the Albert Einstein Archives.

DIGITIZED MANUSCRIPTS

From 2003 to 2011, the site included approximately 3,000 high-quality digitized images of Einstein’s writings. This digitization of more than 900 documents written by Einstein was made possible by generous grants from the David and Fela Shapell family of Los Angeles. As of 2012, the site will enable free viewing and browsing of approximatelly 7,000 high-quality digitized images of Einstein’s writings. The digitization of close to 2,000 documents written by Einstein was produced by the Albert Einstein Archives Digitization Project and was made possible by the generous contribution of the Polonsky foundation. The digitization  project will continue throughout 2012.

FINDING AID

The site enables access to the online version of the Albert Einstein Archives Finding Aid, a comprehensive description of the entire repository of Albert Einstein’s personal papers held at the Hebrew University. The Finding Aid, presented in Encoded Archival Description (EAD) format, provides the following information on the Einstein Archives: its identity, context, content, structure, conditions of access and use. It also contains a list of the folders in the Archives which will enable access to the Archival Database and to the Digitized Manuscripts.

ARCHIVAL DATABASE

From 2003 to 2011, the Archival Database included approximately 43,000 records of Einstein and Einstein-related documents. Supplementary archival holdings and databases pertaining to Einstein documents have been established at both the Einstein Papers Project and the Albert Einstein
Archives
for scholarly research. As of 2012 the Archival Database allows direct access to all 80,000 records of Einstein and Einstein-related documents in the original and the supplementary archive. The records published in this online version pertain to Albert Einstein’s scientific and non-scientific writings, his professional and personal correspondence, notebooks, travel diaries, personal documents, and third-party items contained in both the original collection of Einstein’s personal papers and in the supplementary archive.

Unless you are a professional archivist, I suspect you will want to start with the Gallery. Which for some UI design reason appears at the bottom of the homepage in small type. (Hint: It really should be a logo at top left, to interest the average visitor.)

When you do reach mss. images, the zoom/navigation is quite responsive, although a slightly larger image to clue the reader in on location would be better. In fact, one that is readable and yet subject to zoom would be ideal.

Another improvement would be to display a URL to allow exchange of links to particular images, along with X/Y coordinates to the images. As presented, every reader has to re-find information in images for themselves.

Archiving material is good. Digital archives that enable wider access is better. Being able to reliably point into digital archives for commentary, comparison and other purposes is great.

Social Networks and Archival Context Project (SNAC)

Wednesday, January 11th, 2012

Social Networks and Archival Context Project (SNAC)

From the homepage:

The Social Networks and Archival Context Project (SNAC) will address the ongoing challenge of transforming description of and improving access to primary humanities resources through the use of advanced technologies. The project will test the feasibility of using existing archival descriptions in new ways, in order to enhance access and understanding of cultural resources in archives, libraries, and museums.

Archivists have a long history of describing the people who—acting individually, in families, or in formally organized groups—create and collect primary sources. They research and describe the people who create and are represented in the materials comprising our shared cultural legacy. However, because archivists have traditionally described records and their creators together, this information is tied to specific resources and institutions. Currently there is no system in place that aggregates and interrelates those descriptions.

Leveraging the new standard Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF), the SNAC Project will use digital technology to “unlock” descriptions of people from finding aids and link them together in exciting new ways.

On the Prototype page you will find the following description:

While many of the names found in finding aids have been carefully constructed, frequently in consultation with LCNAF, many other names present extraction and matching challenges. For example, many personal names are in direct rather than indirect (or catalog entry) order. Life dates, if present, some times appear in parentheses or brackets. Numerous names some times appear in the same <persname>, <corpname>, or <famname>. Many names are incorrectly tagged, for example, a personal name tagged as a .

We will continue to refine the extraction and matching algorithms over the course of the project, but it is anticipated that it will only be possible to address some problems through manual editing, perhaps using “professional crowd sourcing.”

While the project is still a prototype, it occurs to me that it would make a handy source of identifiers.

Try:

Or one of the many others you will find at: Find Corporate, Personal, and Family Archival Context Records.

OK, now I have a question for you: All of the foregoing also appear in Wikipedia.

For your comparison:

If you could choose only one identifier for a subject, would you choose the SNAC or the Wikipedia links?

I ask because some semantic approaches take a “one ring” approach to identification. Ignoring the existence of multiple identifiers, even URL identifiers for the same subjects.

Of course, you already know that with topic maps you can have multiple identifiers for any subject.

In CTM syntax:

bush-vannevar
href=”http://socialarchive.iath.virginia.edu/xtf/view?docId=bush-vannevar-1890-1974-cr.xml ;
href=”http://en.wikipedia.org/wiki/Vannevar_Bush ;
- “Vannevar Bush” ;
- varname: “Bush, Vannevar, 1890-1974″ ;
- varname: “Bush, Vannevar, 1890-” .

Which of course means that if I want to make a statement about the webpage for Vannevar Bush at Wikipedia, I can do so without any confusion:

wikipedia-vannevar-bush
= href=”http://en.wikipedia.org/wiki/Vannevar_Bush ;
descr: “URL as subject locator.” .

Or I can comment on a page at SNAC and map additional information to it. And you will always know if I am using the URL as an identifier or to point you towards a subject.

Digging into Data Challenge

Thursday, January 5th, 2012

Digging into Data Challenge

From the homepage:

What is the “challenge” we speak of? The idea behind the Digging into Data Challenge is to address how “big data” changes the research landscape for the humanities and social sciences. Now that we have massive databases of materials used by scholars in the humanities and social sciences — ranging from digitized books, newspapers, and music to transactional data like web searches, sensor data or cell phone records — what new, computationally-based research methods might we apply? As the world becomes increasingly digital, new techniques will be needed to search, analyze, and understand these everyday materials. Digging into Data challenges the research community to help create the new research infrastructure for 21st century scholarship.

Winners for Round 2, some 14 projects out of 67, were announced on 3 January 2012.

Interested to hear your comments on the projects as I am sure the projects would as well.

Tiered Storage Approaches to Big Data:…

Tuesday, December 13th, 2011

Tiered Storage Approaches to Big Data: Why look to the Cloud when you’re working with Galaxies?

Event Date: 12/15/2011 02:00 PM Eastern Standard Time

From the email:

The ability for organizations to keep up with the growth of Big Data in industries like satellite imagery, genomics, oil and gas, and media and entertainment has strained many storage environments. Though storage device costs continue to be driven down, corporations and research institutions have to look to setting up tiered storage environments to deal with increasing power and cooling costs and shrinking data center footprint of storing all this big data.

NASA’s Earth Observing System Data and Information Management (EOSDIS) is arguably a poster child when looking at large image file ingest and archive. Responsible for processing, archiving, and distributing Earth science satellite data (e.g., land, ocean and atmosphere data products), NASA EOSDIS handles hundreds of millions of satellite image data files averaging roughly from 7 MB to 40 MB in size and totaling over 3PB of data.

Discover long-term data tiering, archival, and data protection strategies for handling large files using a product like Quantum’s StorNext data management solution and similar solutions from a panel of three experts. Hear how NASA EOSDIS handles its data workflow and long term archival across four sites in North America and makes this data freely available to scientists.

Think of this as a starting point to learn some of the “lingo” in this area and perhaps hear some good stories about data and NASA.

Some questions to think about during the presentation/discussion:

How do you effectively access information after not only the terminology but the world view of a discipline has changed?

What do you have to know about the data and its storage?

How do the products discussed address those questions?

National Archives Digitization Tools Now on GitHub

Saturday, October 22nd, 2011

National Archives Digitization Tools Now on GitHub

From the post:

As part of our open government initiatives, the National Archives has begun to share applications developed in-house on GitHub, a social coding platform. GitHub is a service used by software developers to share and collaborate on software development projects and many open source development projects.

Over the last year and a half, our Digitization Services Branch has developed a number of software applications to facilitate digitization workflows. These applications have significantly increased our productivity and improved the accuracy and completeness of our digitization work.

We shared our experiences with these applications with colleagues at other institutions such as the Library of Congress and the Smithsonian Institution, and they expressed interest in trying these applications within their own digitization workflows. We have made two digitization applications, “File Analyzer and Metadata Harvester” and “Video Frame Analyzer” available on GitHub, and they are now available for use by other institutions and the public.

I suspect many government departments (U.S. and otherwise) have similar digitization workflow efforts underway. Perhaps greater publicity about these efforts will cause other departments to step forward.

Towards georeferencing archival collections

Friday, October 21st, 2011

Towards georeferencing archival collections

From the post:

One of the most effective ways to associate objects in archival collections with related objects is with controlled access terms: personal, corporate, and family names; places; subjects. These associations are meaningless if chosen arbitrarily. With respect to machine processing, Thomas Jefferson and Jefferson, Thomas are not seen as the same individual when judging by the textual string alone. While EADitor has incorporated authorized headings from LCSH and local vocabulary (scraped from terms found in EAD files currently in the eXist database) almost since its inception, it has not until recently interacted with other controlled vocabulary services. Interacting with EAC-CPF and geographical services is high on the development priority list.

geonames.org

Over the last week, I have been working on incorporating geonames.org queries into the XForms application. Geonames provides stable URIs for more than 7.5 million place names internationally. XML representations of each place are accessible through various REST APIs. These XML datastreams also include the latitude and longitude, which will make it possible to georeference archival collections as a whole or individual items within collections (an item-level indexing strategy will be offered in EADitor as an alternative to traditional, collection-based indexing soon).

This looks very interesting.

Details:

EADitor project site (Google Code): http://code.google.com/p/eaditor/
Installation instructions (specific for Ubuntu but broadly applies to all Unix-based systems): http://code.google.com/p/eaditor/wiki/UbuntuInstallation
Google Group: http://groups.google.com/group/eaditor