Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 7, 2013

Hot Topics: The DuraSpace Community Webinar Series

Filed under: Archives,Data Preservation,DSpace,Preservation — Patrick Durusau @ 7:14 pm

Hot Topics: The DuraSpace Community Webinar Series

From the DuraSpace about page:

DuraSpace supported open technology projects provide long-term, durable access to and discovery of digital assets. We put together global, strategic collaborations to sustain DSpace and Fedora, two of the most widely-used repository solutions in the world. More than fifteen hundred institutions use and help develop these open source software repository platforms. DSpace and Fedora are directly supported with in-kind contributions of development resources and financial donations through the DuraSpace community sponsorship program.

Like most of you, I’m familiar with DSpace andFedora but I wasn’t familiar with the “Hot Topics” webinar series. I was following a link from Recommended! “Metadata and Repository Services for Research Data Curation” Webinar by Imma Subirats, when I encountered the “Hot Topics” page.

  • Series Six: Research Data in Repositories
  • Series Five: VIVO–Research Discovery and Networking
  • Series Four: Research Data Management Support
  • Series Three: Get a Head on Your Repository with Hydra End-to-End Solutions
  • Series Two: Managing and Preserving Audio and Video in your Digital Repository
  • Series One: Knowledge Futures: Digital Preservation Planning

Each series consists of three (3) webinars, all with recordings, most with slides as well.

Warning: Data curation doesn’t focus on the latest and coolest GPU processing techniques.

But, in ten to fifteen years when GPU techniques are like COBOL is now, good data curation will enable future students to access those techniques.

I think that is worthwhile.

You?

November 5, 2013

The Shelley-Godwin Archive

The Shelley-Godwin Archive

From the homepage:

The Shelley-Godwin Archive will provide the digitized manuscripts of Percy Bysshe Shelley, Mary Wollstonecraft Shelley, William Godwin, and Mary Wollstonecraft, bringing together online for the first time ever the widely dispersed handwritten legacy of this uniquely gifted family of writers. The result of a partnership between the New York Public Library and the Maryland Institute for Technology in the Humanities, in cooperation with Oxford’s Bodleian Library, the S-GA also includes key contributions from the Huntington Library, the British Library, and the Houghton Library. In total, these partner libraries contain over 90% of all known relevant manuscripts.

In case you don’t recognize the name, Mary Shelley wrote Frankenstein; or, The Modern Prometheus; William Godwin, philosopher, early modern (unfortunately theoretical) anarchist; Percy Bysshe Shelley, English Romantic Poet; Mary Wollstonescraft, writer, feminist. Quite a group for the time or even now.

From the About page on Technological Infrastructure:

The technical infrastructure of the Shelley-Godwin Archive builds on linked data principles and emerging standards such as the Shared Canvas data model and the Text Encoding Initiative’s Genetic Editions vocabulary. It is designed to support a participatory platform where scholars, students, and the general public will be able to engage in the curation and annotation of the Archive’s contents.

The Archive’s transcriptions and software applications and libraries are currently published on GitHub, a popular commercial host for projects that use the Git version control system.

  • TEI transcriptions and other data
  • Shared Canvas viewer and search service
  • Shared Canvas manifest generation

All content and code in these repositories is available under open licenses (the Apache License, Version 2.0 and the Creative Commons Attribution license). Please see the licensing information in each individual repository for additional details.

Shared Canvas and Linked Open Data

Shared Canvas is a new data model designed to facilitate the description and presentation of physical artifacts—usually textual—in the emerging linked open data ecosystem. The model is based on the concept of annotation, which it uses both to associate media files with an abstract canvas representing an artifact, and to enable anyone on the web to describe, discuss, and reuse suitably licensed archival materials and digital facsimile editions. By allowing visitors to create connections to secondary scholarship, social media, or even scenes in movies, projects built on Shared Canvas attempt to break down the walls that have traditionally enclosed digital archives and editions.

Linked open data or content is published and licensed so that “anyone is free to use, reuse, and redistribute it—subject only, at most, to the requirement to attribute and/or share-alike,” (from http://opendefinition.org/) with the additional requirement that when an entity such as a person, a place, or thing that has a recognizable identity is referenced in the data, the reference is made using a well-known identifier—called a universal resource identifier, or “URI”—that can be shared between projects. Together, the linking and openness allow conformant sets of data to be combined into new data sets that work together, allowing anyone to publish their own data as an augmentation of an existing published data set without requiring extensive reformulation of the information before it can be used by anyone else.

The Shared Canvas data model was developed within the context of the study of medieval manuscripts to provide a way for all of the representations of a manuscript to co-exist in an openly addressable and shareable form. A relatively well-known example of this is the tenth-century Archimedes Palimpsest. Each of the pages in the palimpsest was imaged using a number of different wavelengths of light to bring out different characteristics of the parchment and ink. For example, some inks are visible under one set of wavelengths while other inks are visible under a different set. Because the original writing and the newer writing in the palimpsest used different inks, the images made using different wavelengths allow the scholar to see each ink without having to consciously ignore the other ink. In some cases, the ink has faded so much that it is no longer visible to the naked eye. The Shared Canvas data model brings together all of these different images of a single page by considering each image to be an annotation about the page instead of a surrogate for the page. The Shared Canvas website has a viewer that demonstrates how the imaging wavelengths can be selected for a page.

One important bit, at least for topic maps, is the view of the Shared Canvas data model that:

each image [is considered] to be an annotation about the page instead of a surrogate for the page.

If I tried to say that or even re-say it, it would be much more obscure. 😉

Whether “annotation about” versus “surrogate for” will catch on beyond manuscript studies it’s hard to say.

Not the way it is usually said in topic maps but if other terminology is better understood, why not?

July 10, 2013

Data Sharing and Management Snafu in 3 Short Acts

Filed under: Archives,Astroinformatics,Open Access,Open Data — Patrick Durusau @ 1:43 pm

As you may suspect, my concerns are focused on the preservation of the semantics of the field names, Sam1, Sam2, Sam3, but also with the field names that will be generated by the requesting researcher.

I found this video embedded in: A call for open access to all data used in AJ and ApJ articles by Kelle Cruz.

From the post:

I don’t fully understand it, but I know the Astronomical Journal (AJ) and Astrophysical Journal (ApJ) are different than many other journals: They are run by the American Astronomical Society (AAS) and not by a for-profit publisher. That means that the AAS Council and the members (the people actually producing and reading the science) have a lot of control over how the journals are run. In a recent President’s Column, the AAS President, David Helfand proposed a radical, yet obvious, idea for propelling our field into the realm of data sharing and open access: require all journal articles to be accompanied by the data on which the conclusions are based.

We are a data-rich—and data-driven—field [and] I am advocating [that authors provide] a link in articles to the data that underlies a paper’s conclusions…In my view, the time has come—and the technological resources are available—to make the conclusion of every ApJ or AJ article fully reproducible by publishing the data that underlie that conclusion. It would be an important step toward enhancing and sharing our scientific understanding of the universe.

Kelle points out several reasons why existing efforts are insufficient to meet the sharing and archiving needs of the astronomical community.

Suggested reading if you are concerned with astronomical data or archives more generally.

June 30, 2013

Preservation Vocabularies [3 types of magnetic storage medium?]

Filed under: Archives,Library,Linked Data,Vocabularies — Patrick Durusau @ 12:30 pm

Preservation Datasets

From the webpage:

The Linked Data Service is to provide access to commonly found standards and vocabularies promulgated by the Library of Congress. This includes data values and the controlled vocabularies that house them. Below are descriptions of each preservation vocabulary derived from the PREMIS standard. Inside each, a search box allows you to search the vocabularies individually .

New preservation vocabularies from the Library of Congress.

Your mileage will vary with these vocabularies.

Take storage for example.

As we all learned in school, there are only three kinds of magnetic “storage medium:”

  • hard disk
  • magnetic tape
  • TSM

😉

In case you don’t recognize TSM, it stands for IBM Tivoli Storage Manager.

Hmmmm, what about the twenty (20) types of optical disks?

Or other forms of magnetic media? Such as thumb drives, floppy disks, etc.

I pick “storage medium” at random.

Take a look at some of the other vocabularies and let me know what you think.

Please include links to more information in case the LOC decides to add more entries to its vocabularies.

I first saw this at: 21 New Preservation Vocabularies available at id.loc.gov.

June 24, 2013

OpenGLAM

Filed under: Archives,Library,Museums,Open Data — Patrick Durusau @ 9:14 am

OpenGLAM

From the FAQ:

What is OpenGLAM?

OpenGLAM (Galleries, Libraries, Archives and Museum) is an initiative coordinated by the Open Knowledge Foundation that is committed to building a global cultural commons for everyone to use, access and enjoy.

OpenGLAM helps cultural institutions to open up their content and data through hands-on workshops, documentation and guidance and it supports a network of open culture evangelists through its Working Group.

What do we mean by “open”?

“Open” is a term you hear a lot these days. We’ve tried to get some clarity around this important issue by developing a clear and succinct definition of openness – see Open Definition.

The Open Definition says that a piece of content or data is open if “anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.”

There a number of Open Definition compliant licenses that GLAMs are increasingly using to license digital content and data that they hold. Popular ones for data include CC-0 and for content CC-BY or CC-BY-SA are often used.

Open access to cultural heritage materials will grow the need for better indexing/organization. As if you needed another reason to support it. 😉

May 26, 2013

SecLists.Org Security Mailing List Archive

Filed under: Archives,Cybersecurity,Security — Patrick Durusau @ 3:09 pm

SecLists.Org Security Mailing List Archive

Speaking of reading material for the summer, how do you keep up with hacking news?

From the webpage:

Any hacker will tell you that the latest news and exploits are not found on any web site—not even Insecure.Org. No, the cutting edge in security research is and will continue to be the full disclosure mailing lists such as Bugtraq. Here we provide web archives and RSS feeds (now including message extracts), updated in real-time, for many of our favorite lists. Browse the individual lists below, or search them all:

Site includes one of those “hit or miss” search boxes that doesn’t learn from the successes of other users.

It’s better than reading each post separately, but only just.

With every search, you still have to read the posts, over and over again.

SIAM Archives

Filed under: Algorithms,Archives,Computer Science — Patrick Durusau @ 2:05 pm

I saw an announcement for SDM 2014 : SIAM International Conference on Data Mining, Philadelphia, Pennsylvania, USA, April 24 – 26, 2014, today but the call for papers hasn’t appeared, yet.

While visiting the conference site I followed the proceedings link to discover:

Workshop on Algorithm Engineering and Experiments (ALENEX) 2006 – 2013

Workshop on Analytic Algorithmics and Combinatorics (ANALCO) 2006 – 2013

ACM-SIAM Symposium on Discrete Algorithms (SODA) 2009 – 2013

Data Mining – 2001 – 2013

Mathematics for Industry 2009

Just in case you are short on reading material for the summer. 😉

May 2, 2013

Create and Manage Data: Training Resources

Filed under: Archives,Data,Preservation — Patrick Durusau @ 2:07 pm

Create and Manage Data: Training Resources

From the webpage:

Our Managing and Sharing Data: Training Resources present a suite of flexible training materials for people who are charged with training researchers and research support staff in how to look after research data.

The Training Resources complement the UK Data Archive’s popular guide on ‘Managing and Sharing Data: best practice for researchers’, the most recent version published in May 2011.

They  have been designed and used as part of the Archive’s daily work in supporting ESRC applicants and award holders and have been made possible by a grant from the ESRC Researcher Development Initiative (RDI).

The Training Resources are modularised following the UK Data Archive’s seven key areas of managing and sharing data:

  • sharing data – why and how
  • data management planning for researchers and research centres
  • documenting data
  • formatting data
  • storing data, including data security, data transfer, encryption, and file sharing
  • ethics and consent
  • data copyright

Each section contains:

  • introductory powerpoint(s)
  • presenter’s guide – where necessary
  • exercises and introduction to exercises
  • quizzes
  • answers

The materials are presented as used in our own training courses  and are mostly geared towards social scientists. We anticipate trainers will create their own personalised and more context-relevant example, for example by discipline, country, relevant laws and regulations.

You can download individual modules from the relevant sections or download the whole resource in pdf format. Updates to pages were last made on 20 June 2012.

Download all resources.

Quite an impressive set of materials that will introduce you to some aspects of research data in the UK. Not all but some aspects.

What you don’t learn here you will pickup from interaction with people actively engaged with research data.

But it will give you a head start on understanding the research data community.

Unlike some technologies, topic maps are more about a community’s world view than the world view of topic maps.

April 9, 2013

Springer Book Archives [Proposal for Access]

Filed under: Archives,Books — Patrick Durusau @ 11:16 am

The Springer Book Archives now contain 72,000 titles

From the post:

Today at the British UKSG Conference in Bournemouth, Springer announced that the Springer Book Archives (SBA) now contain 72,000 eBooks. This news represents the latest developments in a project that seeks to digitize nearly every Springer book ever published, dating back to 1842 when the publishing company was founded. The titles are being digitized and made available again for the scientific community through SpringerLink (link.springer.com), Springer’s online platform.

By the end of 2013 an unprecedented collection of around 100,000 historic, scholarly eBooks, in both English and German, will be available through the SBA. Researchers, students and librarians will be able to access the full text of these books free of any digital rights management. Springer also offers a print-on-demand option for most of the books.

Notable authors whose works Springer has published include high-level researchers and Nobel laureates, such as Werner von Siemens, Rudolf Diesel, Emil Fischer and Marie Curie.Their publications will be a valuable addition to this historic online archive.

SBA section at Springer: http://www.springer.com/bookarchives

A truly remarkable achievement but access will remain problematic for a number of potential users.

I would like to see the United States government purchase (as in pay an annual fee) unlimited access to SpringerLink for any U.S. based IP address.

Springer gets more revenue than it does now from U.S. domains, reduces Springer’s licensing costs, benefits all colleges and universities, and provides everyone in the U.S. with access to first rate technical publications.

Not to mention that Springer gets the revenue from selling the print-on-demand paperback editions.

Seems like a no-brainer if you are looking to jump start a knowledge economy.

PS: Forward this to your Senator/Representative. Could be a viable model to satisfy the needs of publishers and readers.

I first saw this at: Springer Book Archives Now Contain 72,000 eBooks by Africa S. Hands.

March 19, 2013

… Preservation and Stewardship of Scholarly Works, 2012 Supplement

Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement by Charles W. Bailey, Jr.

From the webpage:

In a rapidly changing technological environment, the difficult task of ensuring long-term access to digital information is increasingly important. The Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement presents over 130 English-language articles, books, and technical reports published in 2012 that are useful in understanding digital curation and preservation. This selective bibliography covers digital curation and preservation copyright issues, digital formats (e.g., media, e-journals, research data), metadata, models and policies, national and international efforts, projects and institutional implementations, research studies, services, strategies, and digital repository concerns.

It is a supplement to the Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, which covers over 650 works published from 2000 through 2011. All included works are in English. The bibliography does not cover conference papers, digital media works (such as MP3 files), editorials, e-mail messages, letters to the editor, presentation slides or transcripts, unpublished e-prints, or weblog postings.

The bibliography includes links to freely available versions of included works. If such versions are unavailable, italicized links to the publishers' descriptions are provided.

Links, even to publisher versions and versions in disciplinary archives and institutional repositories, are subject to change. URLs may alter without warning (or automatic forwarding) or they may disappear altogether. Inclusion of links to works on authors' personal websites is highly selective. Note that e-prints and published articles may not be identical.

The bibliography is available under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

Supplement to “the” starting point for research on digital curation.

March 9, 2013

Research Data Symposium – Columbia

Research Data Symposium – Columbia.

Posters from the Research Data Symposium, held at Columbia University, February 27, 2013.

Subject to the limitations of the poster genre but useful as a quick overview of current projects and directions.

February 24, 2013

usenet-legend

Filed under: Archives,Software — Patrick Durusau @ 7:56 pm

usenet-legend by Zach Beane

From the description:

This is Usenet Legend, an application for producing a searchable archive of an author’s comp.lang.lisp history from Ron Garrett’s large archive dump.

Zach mentions this in his post The Rob Warnock Lisp Usenet Archive but I thought it needed a separate post.

Making content more navigable is always a step in the right direction.

January 16, 2013

Research Data Curation Bibliography

Filed under: Archives,Curation,Data Preservation,Librarian/Expert Searchers,Library — Patrick Durusau @ 7:56 pm

Research Data Curation Bibliography (version 2) by Charles W. Bailey.

From the introduction:

The Research Data Curation Bibliography includes selected English-language articles, books, and technical reports that are useful in understanding the curation of digital research data in academic and other research institutions. For broader coverage of the digital curation literature, see the author's Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works,which presents over 650 English-language articles, books, and technical reports.

The "digital curation" concept is still evolving. In "Digital Curation and Trusted Repositories: Steps toward Success," Christopher A. Lee and Helen R. Tibbo define digital curation as follows:

Digital curation involves selection and appraisal by creators and archivists; evolving provision of intellectual access; redundant storage; data transformations; and, for some materials, a commitment to long-term preservation. Digital curation is stewardship that provides for the reproducibility and re-use of authentic digital data and other digital assets. Development of trustworthy and durable digital repositories; principles of sound metadata creation and capture; use of open standards for file formats and data encoding; and the promotion of information management literacy are all essential to the longevity of digital resources and the success of curation efforts.

This bibliography does not cover conference papers, digital media works (such as MP3 files), editorials, e-mail messages, interviews, letters to the editor, presentation slides or transcripts, unpublished e-prints, or weblog postings. Coverage of technical reports is very selective.

Most sources have been published from 2000 through 2012; however, a limited number of earlier key sources are also included. The bibliography includes links to freely available versions of included works. If such versions are unavailable, italicized links to the publishers' descriptions are provided.

Such links, even to publisher versions and versions in disciplinary archives and institutional repositories, are subject to change. URLs may alter without warning (or automatic forwarding) or they may disappear altogether. Inclusion of links to works on authors' personal websites is highly selective. Note that e prints and published articles may not be identical.

An archive of prior versions of the bibliography is available.

If you are a beginning library student, take the time to know the work of Charles Bailey. He has consistently made a positive contribution for researchers from very early in the so-called digital revolution.

To the extent that you want to design topic maps for data curation, long or short term, the 200+ items in this bibliography will introduce you to some of the issues you will be facing.

January 10, 2013

Stop Hosting Data and Code on your Lab Website

Filed under: Archives,Data — Patrick Durusau @ 1:45 pm

Stop Hosting Data and Code on your Lab Website by Stephen Turner.

From the post:

It’s happened to all of us. You read about a new tool, database, webservice, software, or some interesting and useful data, but when you browse to http://instititution.edu/~home/professorX/lab/data, there’s no trace of what you were looking for.

THE PROBLEM

This isn’t an uncommon problem. See the following two articles:

Schultheiss, Sebastian J., et al. “Persistence and availability of web services in computational biology.” PLoS one 6.9 (2011): e24914. 

Wren, Jonathan D. “404 not found: the stability and persistence of URLs published in MEDLINE.” Bioinformatics 20.5 (2004): 668-672.

The first gives us some alarming statistics. In a survey of nearly 1000 web services published in the Nucleic Acids Web Server Issue between 2003 and 2009:

  • Only 72% were still available at the published address.
  • The authors could not test the functionality for 33% because there was no example data, and 13% no longer worked as expected.
  • The authors could only confirm positive functionality for 45%.
  • Only 274 of the 872 corresponding authors answered an email.
  • Of these 78% said a service was developed by a student or temporary researcher, and many had no plan for maintenance after the researcher had moved on to a permanent position.

The Wren et al. paper found that of 1630 URLs identified in Pubmed abstracts, only 63% were consistently available. That rate was far worse for anonymous login FTP sites (33%).

Is this a problem for published data in the topic map community?

What data should we be archiving? Discussion lists? Blogs? Public topic maps?

What do you think of Stephen’s solution?

November 19, 2012

D-Lib

Filed under: Archives,Digital Library,Library — Patrick Durusau @ 4:26 pm

D-Lib

From the about page:

D-Lib Magazine is an electronic publication with a focus on digital library research and development, including new technologies, applications, and contextual social and economic issues. D-Lib Magazine appeals to a broad technical and professional audience. The primary goal of the magazine is timely and efficient information exchange for the digital library community to help digital libraries be a broad interdisciplinary field, and not a set of specialties that know little of each other.

I am about to post concerning an article in D-Lib and realized I don’t have a blog entry on D-Lib!

Not that it is topic map specific but it is digital library specific, with all the issues that entails. Remarkably similar to the issues any topic map author or software will face.

D-Lib has proven what many of us suspected:

The quality of content is not related to the medium of delivery.

Enjoy!

November 2, 2012

R mailing lists archive

Filed under: Archives,R — Patrick Durusau @ 4:29 pm

R mailing lists archive

From the webpage:

This is an archive of the four main R mailing lists, R-announce, R-packages, R-help and R-devel, as well as the New Zealand and Australian list R-downunder. The archive is automatically updated multiple times a day, so anything posted to the list should be in the archive within about 2 hours….

I saw this on a Twitter feed from One R Tip a Day.

A good example of current email archiving practice. A baseline that you need to exceed (at least) in order to interest users.

What would you do differently?

How would you create a topic map for such a resource?

In what ways would your topic map exceed the capabilities seen here?

September 13, 2012

Europeana opens up data on 20 million cultural items

Filed under: Archives,Data,Dataset,Europeana,Library,Museums — Patrick Durusau @ 3:25 pm

Europeana opens up data on 20 million cultural items by Jonathan Gray (Open Knowledge Foundation):

From the post:

Europe‘s digital library Europeana has been described as the ‘jewel in the crown’ of the sprawling web estate of EU institutions.

It aggregates digitised books, paintings, photographs, recordings and films from over 2,200 contributing cultural heritage organisations across Europe – including major national bodies such as the British Library, the Louvre and the Rijksmuseum.

Today [Wednesday, 12 September 2012] Europeana is opening up data about all 20 million of the items it holds under the CC0 rights waiver. This means that anyone can reuse the data for any purpose – whether using it to build applications to bring cultural content to new audiences in new ways, or analysing it to improve our understanding of Europe’s cultural and intellectual history.

This is a coup d’etat for advocates of open cultural data. The data is being released after a grueling and unenviable internal negotiation process that has lasted over a year – involving countless meetings, workshops, and white papers presenting arguments and evidence for the benefits of openness.

That is good news!

A familiar issue that it overcomes:

To complicate things even further, many public institutions actively prohibit the redistribution of information in their catalogues (as they sell it to – or are locked into restrictive agreements with – third party companies). This means it is not easy to join the dots to see which items live where across multiple online and offline collections.

Oh, yeah! That was one of Google’s reasons for pulling the plug on the Open Knowledge Graph. Google had restrictive agreements so you can only connect the dots with Google products. (I think there is a name for that, let me think about it. Maybe an EU prosecutor might know it. You could always ask.)

What are you going to be mapping from this collection?

September 11, 2012

Linked Data in Libraries, Archives, and Museums

Filed under: Archives,Library,Linked Data,Museums — Patrick Durusau @ 2:23 pm

Linked Data in Libraries, Archives, and Museums Information Standards Quarterly (ISQ) Spring/Summer 2012, Volume 24, no. 2/3 http://dx.doi.org/10.3789/isqv24n2-3.2012.

Interesting reading on linked data.

I have some comments on the “discovery” of the need to manage “diverse, heterogeneous metadata” but will save them for another post.

From the “flyer” that landed in my inbox:

The National Information Standards Organization (NISO) announces the publication of a special themed issue of the Information Standards Quarterly (ISQ) magazine on Linked Data for Libraries, Archives, and Museums. ISQ Guest Content Editor, Corey Harper, Metadata Services Librarian, New York University has pulled together a broad range of perspectives on what is happening today with linked data in cultural institutions. He states in his introductory letter, “As the Linked Data Web continues to expand, significant challenges remain around integrating such diverse data sources. As the variance of the data becomes increasingly clear, there is an emerging need for an infrastructure to manage the diverse vocabularies used throughout the Web-wide network of distributed metadata. Development and change in this area has been rapidly increasing; this is particularly exciting, as it gives a broad overview on the scope and breadth of developments happening in the world of Linked Open Data for Libraries, Archives, and Museums.”

The feature article by Gordon Dunsire, Corey Harper, Diane Hillmann, and Jon Phipps on Linked Data Vocabulary Management describes the shift in popular approaches to large-scale metadata management and interoperability to the increasing use of the Resource Description Framework to link bibliographic data into the larger web community. The authors also identify areas where best practices and standards are needed to ensure a common and effective linked data vocabulary infrastructure.

Four “in practice” articles illustrate the growth in the implementation of linked data in the cultural sector. Jane Stevenson in Linking Lives describes the work to enable structured and linked data from the Archives Hub in the UK. In Joining the Linked Data Cloud in a Cost-Effective Manner, Seth van Hooland, Ruben Verborgh, and Rik Van de Walle show how general purpose Interactive Data Transformation tools, such as Google Refine, can be used to efficiently perform the necessary task of data cleaning and reconciliation that precedes the opening up of linked data. Ted Fons, Jeff Penka, and Richard Wallis discuss OCLC’s Linked Data Initiative and the use of Schema.org in WorldCat to make library data relevant on the web. In Europeana: Moving to Linked Open Data , Antoine Isaac, Robina Clayphan, and Bernhard Haslhofer explain how the metadata for over 23 million objects are being converted to an RDF-based linked data model in the European Union’s flagship digital cultural heritage initiative.

Jon Voss provides a status on Linked Open Data for Libraries, Archives, and Museums (LODLAM) State of Affairs and the annual summit to advance this work. Thomas Elliott, Sebastian Heath, John Muccigrosso Report on the Linked Ancient World Data Institute, a workshop to further the availability of linked open data to create reusable digital resources with the classical studies disciplines.

Kevin Ford wraps up the contributed articles with a standard spotlight article on LC’s Bibliographic Framework Initiative and the Attractiveness of Linked Data. This Library of Congress-led community effort aims to transition from MARC 21 to a linked data model. “The move to a linked data model in libraries and other cultural institutions represents one of the most profound changes that our community is confronting,” stated Todd Carpenter, NISO Executive Director. “While it completely alters the way we have always described and cataloged bibliographic information, it offers tremendous opportunities for making this data accessible and usable in the larger, global web community. This special issue of ISQ demonstrates the great strides that libraries, archives, and museums have already made in this arena and illustrates the future world that awaits us.”

August 10, 2012

….Comparing Digital Preservation Glossaries [Why Do We Need Common Vocabularies?]

Filed under: Archives,Digital Library,Glossary,Preservation — Patrick Durusau @ 8:28 am

From AIP to Zettabyte: Comparing Digital Preservation Glossaries

Emily Reynolds (2012 Junior Fellow) writes:

As we mentioned in our introductory post last month, the OSI Junior Fellows are working on a project involving a draft digital preservation policy framework. One component of our work is revising a glossary that accompanies the framework. We’ve spent the last two weeks poring through more than two dozen glossaries relating to digital preservation concepts to locate and refine definitions to fit the terms used in the document.

We looked at dictionaries from well-established archival entities like the Society of American Archivists, as well as more strictly technical organizations like the Internet Engineering Task Force. While some glossaries take a traditional archival approach, others were more technical; we consulted documents primarily focusing on electronic records, archives, digital storage and other relevant fields. Because of influential frameworks like the OAIS Reference Model, some terms were defined similarly across the glossaries that we looked at. But the variety in the definitions for other terms points to the range of practitioners discussing digital preservation issues, and highlights the need for a common vocabulary. Based on what we found, that vocabulary will have to be broadly drawn and flexible to meet different kinds of requirements.

OSI = Office of Strategic Initiatives (Library of Congress)

Not to be overly critical, but I stumble over:

Because of influential frameworks like the OAIS Reference Model, some terms were defined similarly across the glossaries that we looked at. But the variety in the definitions for other terms points to the range of practitioners discussing digital preservation issues, and highlights the need for a common vocabulary.

Why does a “variety in the definitions for other terms…highlight[s] the need for a common vocabulary?”

I take it as a given that we have diverse vocabularies.

And that attempts at “common” vocabularies succeed in creating yet another “diverse” vocabulary.

So, why would anyone looking at “diverse” vocabularies jump to the conclusion that a “common” vocabulary is required?

Perhaps what is missing is the definition of the problem presented by “diverse” vocabularies.

Hard to solve a problem if you don’t know it is. (Hasn’t stopped some people that I know but that is a story for another day.)

I put it to you (and in your absence I will answer, so answer quickly):

What is the problem (or problems) presented by diverse vocabularies? (Feel free to use examples.)

Or if you prefer, Why do we need common vocabularies?

July 11, 2012

June 10, 2012

Citizen Archivist Dashboard [“…help the next person discover that record”]

Filed under: Archives,Crowd Sourcing,Indexing,Tagging — Patrick Durusau @ 8:15 pm

Citizen Archivist Dashboard

What’s the common theme of these interfaces from the National Archives (United States)?

  • Tag – Tagging is a fun and easy way for you to help make National Archives records found more easily online. By adding keywords, terms, and labels to a record, you can do your part to help the next person discover that record. For more information about tagging National Archives records, follow “Tag It Tuesdays,” a weekly feature on the NARAtions Blog. [includes “missions” (sets of materials for tagging), rated as “beginner,” “intermediate,” and “advanced.” Or you can create your own mission.]
  • Transcribe – By contributing to transcriptions, you can help the National Archives make historical documents more accessible. Transcriptions help in searching for the document as well as in reading and understanding the document. The work you do transcribing a handwritten or typed document will help the next person discover and use that record.

    The transcription tool features over 300 documents ranging from the late 18th century through the 20th century for citizen archivists to transcribe. Documents include letters to a civil war spy, presidential records, suffrage petitions, and fugitive slave case files.

    [A pilot project with 300 documents but one you should follow. Public transcription (crowd-sourced if you want the popular term) of documents has the potential to open up vast archives of materials.]

  • Edit Articles – Our Archives Wiki is an online space for researchers, educators, genealogists, and Archives staff to share information and knowledge about the records of the National Archives and about their research.

    Here are just a few of the ways you may want to participate:

    • Create new pages and edit pre-existing pages
    • Share your research tips
    • Store useful information discovered during research
    • Expand upon a description in our online catalog

    Check out the “Getting Started” page. When you’re ready to edit, you’ll need to log in by creating a username and password.

  • Upload & Share – Calling all researchers! Start sharing your digital copies of National Archives records on the Citizen Archivist Research group on Flickr today.

    Researchers scan and photograph National Archives records every day in our research rooms across the country — that’s a lot of digital images for records that are not yet available online. If you have taken scans or photographs of records you can help make them accessible to the public and other researchers by sharing your images with the National Archives Citizen Archivist Research Group on Flickr.

  • Index the Census – Citizen Archivists, you can help index the 1940 census!

    The National Archives is supporting the 1940 census community indexing project along with other archives, societies, and genealogical organizations. The release of the decennial census is one of the most eagerly awaited record openings. The 1940 census is available to search and browse, free of charge, on the National Archives 1940 Census web site. But, the 1940 census is not yet indexed by name.

    You can help index the 1940 census by joining the 1940 census community indexing project. To get started you will need to download and install the indexing software, register as an indexing volunteer, and download a batch of images to transcribe. When the index is completed, the National Archives will make the named index available for free.

The common theme?

The tagging entry sums it up with: “…you can do your part to help the next person discover that record.”

That’s the “trick” of topic maps. Once a fact about a subject is found, you can preserve your “finding” for the next person.

April 27, 2012

ArcSpread for analyzing web archives

Filed under: Archives — Patrick Durusau @ 6:10 pm

ArcSpread for analyzing web archives

Pete Warden writes:

Stanford runs a fantastic project for capturing important web pages as they change over time, and then presenting the results in a form that future historians will be able to use. This paper talks about some of the techniques they use for removing boilerplate navigation and ad content, so that researchers can work with the meat of the page.

I was relieved to read:

We did not excise any advertising images from the presented pages, but asked participants to disregard advertising related images.

Poorly done digital newspaper archives remove advertising content on a “meat of the page” theory.

Researchers cannot notice what was advertised, how and at what prices. Ads may not interest us, but may interest others.

At one time thousands if not hundreds of thousands of people knew how Egyptian pyramids were build.

So commonly known it was not written down.

Perhaps there is a lesson there for us.

March 22, 2012

Einstein Archives Online

Filed under: Archives,Dataset — Patrick Durusau @ 7:41 pm

Einstein Archives Online

From the “about” page:

The Einstein Archives Online Website provides the first online access to Albert Einstein’s scientific and non-scientific manuscripts held by the Albert Einstein Archives at the Hebrew University of Jerusalem, constituting the material record of one of the most influential intellects in the modern era. It also enables access to the Einstein Archive Database, a comprehensive source of information on all items in the Albert Einstein Archives.

DIGITIZED MANUSCRIPTS

From 2003 to 2011, the site included approximately 3,000 high-quality digitized images of Einstein’s writings. This digitization of more than 900 documents written by Einstein was made possible by generous grants from the David and Fela Shapell family of Los Angeles. As of 2012, the site will enable free viewing and browsing of approximatelly 7,000 high-quality digitized images of Einstein’s writings. The digitization of close to 2,000 documents written by Einstein was produced by the Albert Einstein Archives Digitization Project and was made possible by the generous contribution of the Polonsky foundation. The digitization  project will continue throughout 2012.

FINDING AID

The site enables access to the online version of the Albert Einstein Archives Finding Aid, a comprehensive description of the entire repository of Albert Einstein’s personal papers held at the Hebrew University. The Finding Aid, presented in Encoded Archival Description (EAD) format, provides the following information on the Einstein Archives: its identity, context, content, structure, conditions of access and use. It also contains a list of the folders in the Archives which will enable access to the Archival Database and to the Digitized Manuscripts.

ARCHIVAL DATABASE

From 2003 to 2011, the Archival Database included approximately 43,000 records of Einstein and Einstein-related documents. Supplementary archival holdings and databases pertaining to Einstein documents have been established at both the Einstein Papers Project and the Albert Einstein
Archives
for scholarly research. As of 2012 the Archival Database allows direct access to all 80,000 records of Einstein and Einstein-related documents in the original and the supplementary archive. The records published in this online version pertain to Albert Einstein’s scientific and non-scientific writings, his professional and personal correspondence, notebooks, travel diaries, personal documents, and third-party items contained in both the original collection of Einstein’s personal papers and in the supplementary archive.

Unless you are a professional archivist, I suspect you will want to start with the Gallery. Which for some UI design reason appears at the bottom of the homepage in small type. (Hint: It really should be a logo at top left, to interest the average visitor.)

When you do reach mss. images, the zoom/navigation is quite responsive, although a slightly larger image to clue the reader in on location would be better. In fact, one that is readable and yet subject to zoom would be ideal.

Another improvement would be to display a URL to allow exchange of links to particular images, along with X/Y coordinates to the images. As presented, every reader has to re-find information in images for themselves.

Archiving material is good. Digital archives that enable wider access is better. Being able to reliably point into digital archives for commentary, comparison and other purposes is great.

January 11, 2012

Social Networks and Archival Context Project (SNAC)

Filed under: Archives,Networks,Social Graphs,Social Networks — Patrick Durusau @ 8:03 pm

Social Networks and Archival Context Project (SNAC)

From the homepage:

The Social Networks and Archival Context Project (SNAC) will address the ongoing challenge of transforming description of and improving access to primary humanities resources through the use of advanced technologies. The project will test the feasibility of using existing archival descriptions in new ways, in order to enhance access and understanding of cultural resources in archives, libraries, and museums.

Archivists have a long history of describing the people who—acting individually, in families, or in formally organized groups—create and collect primary sources. They research and describe the people who create and are represented in the materials comprising our shared cultural legacy. However, because archivists have traditionally described records and their creators together, this information is tied to specific resources and institutions. Currently there is no system in place that aggregates and interrelates those descriptions.

Leveraging the new standard Encoded Archival Context-Corporate Bodies, Persons, and Families (EAC-CPF), the SNAC Project will use digital technology to “unlock” descriptions of people from finding aids and link them together in exciting new ways.

On the Prototype page you will find the following description:

While many of the names found in finding aids have been carefully constructed, frequently in consultation with LCNAF, many other names present extraction and matching challenges. For example, many personal names are in direct rather than indirect (or catalog entry) order. Life dates, if present, some times appear in parentheses or brackets. Numerous names some times appear in the same <persname>, <corpname>, or <famname>. Many names are incorrectly tagged, for example, a personal name tagged as a .

We will continue to refine the extraction and matching algorithms over the course of the project, but it is anticipated that it will only be possible to address some problems through manual editing, perhaps using “professional crowd sourcing.”

While the project is still a prototype, it occurs to me that it would make a handy source of identifiers.

Try:

Or one of the many others you will find at: Find Corporate, Personal, and Family Archival Context Records.

OK, now I have a question for you: All of the foregoing also appear in Wikipedia.

For your comparison:

If you could choose only one identifier for a subject, would you choose the SNAC or the Wikipedia links?

I ask because some semantic approaches take a “one ring” approach to identification. Ignoring the existence of multiple identifiers, even URL identifiers for the same subjects.

Of course, you already know that with topic maps you can have multiple identifiers for any subject.

In CTM syntax:

bush-vannevar
href=”http://socialarchive.iath.virginia.edu/xtf/view?docId=bush-vannevar-1890-1974-cr.xml ;
href=”http://en.wikipedia.org/wiki/Vannevar_Bush ;
– “Vannevar Bush” ;
– varname: “Bush, Vannevar, 1890-1974” ;
– varname: “Bush, Vannevar, 1890-” .

Which of course means that if I want to make a statement about the webpage for Vannevar Bush at Wikipedia, I can do so without any confusion:

wikipedia-vannevar-bush
= href=”http://en.wikipedia.org/wiki/Vannevar_Bush ;
descr: “URL as subject locator.” .

Or I can comment on a page at SNAC and map additional information to it. And you will always know if I am using the URL as an identifier or to point you towards a subject.

January 5, 2012

Digging into Data Challenge

Filed under: Archives,Contest,Data Mining,Library,Preservation — Patrick Durusau @ 4:09 pm

Digging into Data Challenge

From the homepage:

What is the “challenge” we speak of? The idea behind the Digging into Data Challenge is to address how “big data” changes the research landscape for the humanities and social sciences. Now that we have massive databases of materials used by scholars in the humanities and social sciences — ranging from digitized books, newspapers, and music to transactional data like web searches, sensor data or cell phone records — what new, computationally-based research methods might we apply? As the world becomes increasingly digital, new techniques will be needed to search, analyze, and understand these everyday materials. Digging into Data challenges the research community to help create the new research infrastructure for 21st century scholarship.

Winners for Round 2, some 14 projects out of 67, were announced on 3 January 2012.

Interested to hear your comments on the projects as I am sure the projects would as well.

December 13, 2011

Tiered Storage Approaches to Big Data:…

Filed under: Archives,Data,Storage — Patrick Durusau @ 9:47 pm

Tiered Storage Approaches to Big Data: Why look to the Cloud when you’re working with Galaxies?

Event Date: 12/15/2011 02:00 PM Eastern Standard Time

From the email:

The ability for organizations to keep up with the growth of Big Data in industries like satellite imagery, genomics, oil and gas, and media and entertainment has strained many storage environments. Though storage device costs continue to be driven down, corporations and research institutions have to look to setting up tiered storage environments to deal with increasing power and cooling costs and shrinking data center footprint of storing all this big data.

NASA’s Earth Observing System Data and Information Management (EOSDIS) is arguably a poster child when looking at large image file ingest and archive. Responsible for processing, archiving, and distributing Earth science satellite data (e.g., land, ocean and atmosphere data products), NASA EOSDIS handles hundreds of millions of satellite image data files averaging roughly from 7 MB to 40 MB in size and totaling over 3PB of data.

Discover long-term data tiering, archival, and data protection strategies for handling large files using a product like Quantum’s StorNext data management solution and similar solutions from a panel of three experts. Hear how NASA EOSDIS handles its data workflow and long term archival across four sites in North America and makes this data freely available to scientists.

Think of this as a starting point to learn some of the “lingo” in this area and perhaps hear some good stories about data and NASA.

Some questions to think about during the presentation/discussion:

How do you effectively access information after not only the terminology but the world view of a discipline has changed?

What do you have to know about the data and its storage?

How do the products discussed address those questions?

October 22, 2011

National Archives Digitization Tools Now on GitHub

Filed under: Archives,Files,Video — Patrick Durusau @ 3:18 pm

National Archives Digitization Tools Now on GitHub

From the post:

As part of our open government initiatives, the National Archives has begun to share applications developed in-house on GitHub, a social coding platform. GitHub is a service used by software developers to share and collaborate on software development projects and many open source development projects.

Over the last year and a half, our Digitization Services Branch has developed a number of software applications to facilitate digitization workflows. These applications have significantly increased our productivity and improved the accuracy and completeness of our digitization work.

We shared our experiences with these applications with colleagues at other institutions such as the Library of Congress and the Smithsonian Institution, and they expressed interest in trying these applications within their own digitization workflows. We have made two digitization applications, “File Analyzer and Metadata Harvester” and “Video Frame Analyzer” available on GitHub, and they are now available for use by other institutions and the public.

I suspect many government departments (U.S. and otherwise) have similar digitization workflow efforts underway. Perhaps greater publicity about these efforts will cause other departments to step forward.

October 21, 2011

Towards georeferencing archival collections

Towards georeferencing archival collections

From the post:

One of the most effective ways to associate objects in archival collections with related objects is with controlled access terms: personal, corporate, and family names; places; subjects. These associations are meaningless if chosen arbitrarily. With respect to machine processing, Thomas Jefferson and Jefferson, Thomas are not seen as the same individual when judging by the textual string alone. While EADitor has incorporated authorized headings from LCSH and local vocabulary (scraped from terms found in EAD files currently in the eXist database) almost since its inception, it has not until recently interacted with other controlled vocabulary services. Interacting with EAC-CPF and geographical services is high on the development priority list.

geonames.org

Over the last week, I have been working on incorporating geonames.org queries into the XForms application. Geonames provides stable URIs for more than 7.5 million place names internationally. XML representations of each place are accessible through various REST APIs. These XML datastreams also include the latitude and longitude, which will make it possible to georeference archival collections as a whole or individual items within collections (an item-level indexing strategy will be offered in EADitor as an alternative to traditional, collection-based indexing soon).

This looks very interesting.

Details:

EADitor project site (Google Code): http://code.google.com/p/eaditor/
Installation instructions (specific for Ubuntu but broadly applies to all Unix-based systems): http://code.google.com/p/eaditor/wiki/UbuntuInstallation
Google Group: http://groups.google.com/group/eaditor

« Newer Posts

Powered by WordPress