Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

February 14, 2018

Don’t Delete Evil Data [But Remember the Downside of “Evidence”]

Filed under: Archives,Preservation,Social Media — Patrick Durusau @ 8:56 pm

Don’t Delete Evil Data by Lam Thuy Vo.

From the post:

The web needs to be a friendlier place. It needs to be more truthful, less fake. It definitely needs to be less hateful. Most people agree with these notions.

There have been a number of efforts recently to enforce this idea: the Facebook groups and pages operated by Russian actors during the 2016 election have been deleted. None of the Twitter accounts listed in connection to the investigation of the Russian interference with the last presidential election are online anymore. Reddit announced late last fall that it was banning Nazi, white supremacist, and other hate groups.

But even though much harm has been done on these platforms, is the right course of action to erase all these interactions without a trace? So much of what constitutes our information universe is captured online—if foreign actors are manipulating political information we receive and if trolls turn our online existence into hell, there is a case to be made for us to be able to trace back malicious information to its source, rather than simply removing it from public view.

In other words, there is a case to be made to preserve some of this information, to archive it, structure it, and make it accessible to the public. It’s unreasonable to expect social media companies to sidestep consumer privacy protections and to release data attached to online misconduct willy-nilly. But to stop abuse, we need to understand it. We should consider archiving malicious content and related data in responsible ways that allow for researchers, sociologists, and journalists to understand its mechanisms better and, potentially, to demand more accountability from trolls whose actions may forever be deleted without a trace.

By some unspecified mechanism, I would support preservation of all social media. As well as have it publicly available, if it were publicly posted originally. Any restriction or permission to see/use the data will lead to the same abuses we see now.

Twitter, among others, talks about abuse but no one can prove or disprove whatever Twitter cares to say.

There is a downside to preserving social media. You have probably seen the NBC News story on 200,000 tweets that are the smoking gun on Russian interference with the 2016 elections.

Well, except that if you look at the tweets, that’s about as far from a smoking gun on Russian interference as anything you can imagine.

By analogy, that’s why intelligence analysts always say they have evidence and give you their conclusions, but not the evidence. Too much danger you will discover their report is completely fictional.

Or when not wholly fictional, serves their or their agency’s interest.

Keeping evidence is risky business. Just so you are aware.

August 17, 2017

If You See Something, Save Something (Poke A Censor In The Eye)

Filed under: Archives,Data Preservation,Data Replication,Preservation,Web History — Patrick Durusau @ 8:48 pm

If You See Something, Save Something – 6 Ways to Save Pages In the Wayback Machine by Alexis Rossi.

From the post:

In recent days many people have shown interest in making sure the Wayback Machine has copies of the web pages they care about most. These saved pages can be cited, shared, linked to – and they will continue to exist even after the original page changes or is removed from the web.

There are several ways to save pages and whole sites so that they appear in the Wayback Machine. Here are 6 of them.

In the comments, Ellen Spertus mentions a 7th way: Donate to the Internet Archive!

It’s the age of censorship, by governments, DMCA, the EU (right to be forgotten), Facebook, Google, Twitter and others.

Poke a censor in the eye, see something, save something to the Wayback Machine.

The Wayback Machine can’t stop all censorship, so save local and remote copies as well.

Keep poking until all censors go blind.

July 24, 2016

Software Heritage – Universal Software Archive – Indexing/Semantic Challenges

Filed under: Archives,Preservation,Software,Software Preservation — Patrick Durusau @ 7:49 pm

Software Heritage

From the homepage:

We collect and preserve software in source code form, because software embodies our technical and scientific knowledge and humanity cannot afford the risk of losing it.

Software is a precious part of our cultural heritage. We curate and make accessible all the software we collect, because only by sharing it we can guarantee its preservation in the very long term.
(emphasis in original)

The project has already collected:

Even though we just got started, we have already ingested in the Software Heritage archive a significant amount of source code, possibly assembling the largest source code archive in the world. The archive currently includes:

  • public, non-fork repositories from GitHub
  • source packages from the Debian distribution (as of August 2015, via the snapshot service)
  • tarball releases from the GNU project (as of August 2015)

We currently keep up with changes happening on GitHub, and are in the process of automating syncing with all the above source code origins. In the future we will add many more origins and ingest into the archive software that we have salvaged from recently disappeared forges. The figures below allow to peek into the archive and its evolution over time.

The charters of the planned working groups:

Extending the archive

Evolving the archive

Connecting the archive

Using the archive

on quick review did not seem to me to address the indexing/semantic challenges that searching such an archive will pose.

If you are familiar with the differences in metacharacters between different Unix programs, that is only a taste of the differences that will be faced when searching such an archive.

Looking forward to learning more about this project!

October 31, 2015

What is Scholarly HTML?

Filed under: HTML,Preservation,Publishing — Patrick Durusau @ 11:03 am

What is Scholarly HTML? by Robin Berjon and Sébastien Ballesteros.

Abstract:

Scholarly HTML is a domain-specific data format built entirely on open standards that enables the interoperable exchange of scholarly articles in a manner that is compatible with off-the-shelf browsers. This document describes how Scholarly HTML works and how it is encoded as a document. It is, itself, written in Scholarly HTML.

The abstract is accurate enough but the “Motivation” section provides a better sense of this project:

Scholarly articles are still primarily encoded as unstructured graphics formats in which most of the information initially created by research, or even just in the text, is lost. This was an acceptable, if deplorable, condition when viable alternatives did not seem possible, but document technology has today reached a level of maturity and universality that makes this situation no longer tenable. Information cannot be disseminated if it is destroyed before even having left its creator’s laptop.

According to the New York Times, adding structured information to their recipes (instead of exposing simply as plain text) improved their discoverability to the point of producing an immediate rise of 52 percent in traffic (NYT, 2014). At this point in time, cupcake recipes are reaping greater benefits from modern data format practices than the whole scientific endeavour.

This is not solely a loss for the high principles of knowledge sharing in science, it also has very immediate pragmatic consequences. Any tool, any service that tries to integrate with scholarly publishing has to spend the brunt of its complexity (or budget) extracting data the author would have willingly shared out of antiquated formats. This places stringent limits on the improvement of the scholarly toolbox, on the discoverability of scientific knowledge, and particularly on processes of meta-analysis.

To address these issues, we have followed an approach rooted in established best practices for the reuse of open, standard formats. The «HTML Vernacular» body of practice provides guidelines for the creation of domain-specific data formats that make use of HTML’s inherent extensibility (Science.AI, 2015b). Using the vernacular foundation overlaid with «schema.org» metadata we have produced a format for the interchange of scholarly articles built on open standards, ready for all to use.

Our high-level goals were:

  • Uncompromisingly enabling structured metadata, accessibility, and internationalisation.
  • Pragmatically working in Web browsers, even if it occasionally incurs some markup overhead.
  • Powerfully customisable for inclusion in arbitrary Web sites, while remaining easy to process and interoperable.
  • Entirely built on top of open, royalty-free standards.
  • Long-term viability as a data format.

Additionally, in view of the specific problem we addressed, in the creation of this vernacular we have favoured the reliability of interchange over ease of authoring; but have nevertheless attempted to cater to the latter as much as possible. A decent boilerplate template file can certainly make authoring relatively simple, but not as radically simple as it can be. For such use cases, Scholarly HTML provides a great output target and overview of the data model required to support scholarly publishing at the document level.

An example of an authoring format that was designed to target Scholarly HTML as an output is the DOCX Standard Scientific Style which enables authors who are comfortable with Microsoft Word to author documents that have a direct upgrade path to semantic, standard content.

Where semantic modelling is concerned, our approach is to stick as much as possible to schema.org. Beyond the obvious advantages there are in reusing a vocabulary that is supported by all the major search engines and is actively being developed towards enabling a shared understanding of many useful concepts, it also provides a protection against «ontological drift» whereby a new vocabulary is defined by a small group with insufficient input from a broader community of practice. A language that solely a single participant understands is of limited value.

In a small, circumscribed number of cases we have had to depart from schema.org, using the https://ns.science.ai/ (prefixed with sa:) vocabulary instead (Science.AI, 2015a). Our goal is to work with schema.org in order to extend their vocabulary, and we will align our usage with the outcome of these discussions.

I especially enjoyed the observation:

According to the New York Times, adding structured information to their recipes (instead of exposing simply as plain text) improved their discoverability to the point of producing an immediate rise of 52 percent in traffic (NYT, 2014). At this point in time, cupcake recipes are reaping greater benefits from modern data format practices than the whole scientific endeavour.

I don’t doubt the truth of that story but after all, a large number of people are interested in baking cupcakes. Not more than three in many cases, are interested in reading any particular academic paper.

The use of schema.org will provide advantages for common concepts but to be truly useful for scholarly writing, it will require serious extension.

Take for example my post yesterday Deep Feature Synthesis:… [Replacing Human Intuition?, Calling Bull Shit]. What microdata from schema.org would help readers find Propositionalisation and Aggregates, 2001, which describes substantially the same technique, without claims of surpassing human intuition? (Uncited by the authors the paper on deep feature synthesis.)

Or the 161 papers on propositionalisation that you can find at CiteSeer?

A crude classification that can be used by search engines is very useful but falls far short of the mark in terms of finding and retrieving scholarly writing.

Semantic uniformity for classifying scholarly content hasn’t been reached by scholars or librarians despite centuries of effort. Rather than taking up that Sisyphean task, let’s map across the ever increasing universe of semantic diversity.

April 30, 2015

What Should Remain After Retraction?

Filed under: Archives,Citation Practices,Preservation — Patrick Durusau @ 3:39 pm

Antony Williams asks in a tweet:

If a paper is retracted shouldn’t it remain up but watermarked PDF as retracted? More than this? http://pubs.acs.org/doi/abs/10.1021/ja910615z

Here is what you get instead of the front page:

jacs-retraction

A retraction should appear in bibliographic records maintained by the publisher as well as on any online version maintained by the publisher.

The Journal of the American Chemical Society (JACS) method of retraction, removal of the retracted content:

  • Presents a false view of the then current scientific context. Prior to retraction such an article is part of the overall scientific context in a field. Editing that context post-publication, is historical revisionism at its worst.
  • Interrupts the citation chain of publications cited in the retracted publication.
  • Leaves dangling citations of the retracted publication in later publications.
  • Places author who cited the retracted publication in an untenable position. Their citations of a retracted work are suspect with no opportunity to defend their citations.
  • Falsifies the memories of every reader who read the retracted publication. They cannot search for and retrieve that paper in order to revisit an idea, process or result sparked by the retracted publication.

Sound off to: Antony Williams (@ChemConnector) and @RetractionWatch

Let’s leave the creation of false histories to professionals, such as politicians.

October 28, 2014

On Excess: Susan Sontag’s Born-Digital Archive

Filed under: Archives,Library,Open Access,Preservation — Patrick Durusau @ 6:23 pm

On Excess: Susan Sontag’s Born-Digital Archive by Jeremy Schmidt & Jacquelyn Ardam.

From the post:


In the case of the Sontag materials, the end result of Deep Freeze and a series of other processing procedures is a single IBM laptop, which researchers can request at the Special Collections desk at UCLA’s Research Library. That laptop has some funky features. You can’t read its content from home, even with a VPN, because the files aren’t online. You can’t live-Tweet your research progress from the laptop — or access the internet at all — because the machine’s connectivity features have been disabled. You can’t copy Annie Leibovitz’s first-ever email — “Mat and I just wanted to let you know we really are working at this. See you at dinner. xxxxxannie” (subject line: “My first Email”) — onto your thumb drive because the USB port is locked. And, clearly, you can’t save a new document, even if your desire to type yourself into recent intellectual history is formidable. Every time it logs out or reboots, the laptop goes back to ground zero. The folders you’ve opened slam shut. The files you’ve explored don’t change their “Last Accessed” dates. The notes you’ve typed disappear. It’s like you were never there.

Despite these measures, real limitations to our ability to harness digital archives remain. The born-digital portion of the Sontag collection was donated as a pair of external hard drives, and that portion is composed of documents that began their lives electronically and in most cases exist only in digital form. While preparing those digital files for use, UCLA archivists accidentally allowed certain dates to refresh while the materials were in “thaw” mode; the metadata then had to be painstakingly un-revised. More problematically, a significant number of files open as unreadable strings of symbols because the software with which they were created is long out of date. Even the fully accessible materials, meanwhile, exist in so many versions that the hapless researcher not trained in computer forensics is quickly overwhelmed.

No one would dispute the need for an authoritative copy of Sontag‘s archive, or at least as close to authoritative as humanly possible. The heavily protected laptop makes sense to me, assuming that the archive considers that to be the authoritative copy.

What has me puzzled, particularly since there are binary formats not recognized in the archive, is why isn’t a non-authoritative copy of the archive online. Any number of people may still possess the software necessary to read the files and/or be able to decrypt the file formats. That would be a net gain to the archive if recovery could be practiced on a non-authoritative copy. They may well encounter such files in the future.

After searching the Online Archive of California, I did encounter Finding Aid for the Susan Sontag papers, ca. 1939-2004 which reports:

Restrictions Property rights to the physical object belong to the UCLA Library, Department of Special Collections. Literary rights, including copyright, are retained by the creators and their heirs. It is the responsibility of the researcher to determine who holds the copyright and pursue the copyright owner or his or her heir for permission to publish where The UC Regents do not hold the copyright.

Availability Open for research, with following exceptions: Boxes 136 and 137 of journals are restricted until 25 years after Susan Sontag’s death (December 28, 2029), though the journals may become available once they are published.

Unfortunately, this finding aid does not mention Sontag’s computer or the transfer of the files to a laptop. A search of Melvyl (library catalog) finds only one archival collection and that is the one mentioned above.

I have written to the special collections library for clarification and will update this post when an answer arrives.

I mention this collection because of Sontag’s importance for a generation and because digital archives will soon be the majority of cases. One hopes the standard practice will be to donate all rights to an archival repository to insure its availability to future generations of scholars.

October 6, 2014

Bioinformatics tools extracted from a typical mammalian genome project

Filed under: Archives,Bioinformatics,Documentation,Preservation — Patrick Durusau @ 7:55 pm

Bioinformatics tools extracted from a typical mammalian genome project

From the post:

In this extended blog post, I describe my efforts to extract the information about bioinformatics-related items from a recent genome sequencing paper, and the larger issues this raises in the field. It’s long, and it’s something of a hybrid between a blog post and a paper format, just to give it some structure for my own organization. A copy of this will also be posted at FigShare with the full data set. Huge thanks to the gibbon genome project team for a terrific paper and extensively-documented collection of their processes and resources. The issues I wanted to highlight are about the access to bioinformatics tools in general and are not specific to this project at all, but are about the field.

A must read if you are interested in useful preservation of research and data. The paper focuses on needed improvements in bioinformatics but the issues raised are common to all fields.

How well does your field perform when compared to bioinformatics?

March 10, 2014

The Books of Remarkable Women

Filed under: History,Preservation,Topic Maps — Patrick Durusau @ 8:32 am

The Books of Remarkable Women by Sarah J. Biggs.

From the post:

In 2011, when we blogged about the Shaftesbury Psalter (which may have belonged to Adeliza of Louvain; see below), we wrote that medieval manuscripts which had belonged to women were relatively rare survivals. This still remains true, but as we have reviewed our blog over the past few years, it has become clear that we must emphasize the relative nature of the rarity – we have posted literally dozens of times about manuscripts that were produced for, owned, or created by a number of medieval women.

A good example of why I think topic maps have so much to offer for preservation of cultural legacy.

While each of the books covered in this post are important historical artifacts, their value is enhanced by the context of their production, ownership, contemporary practices, etc.

All of which lies outside the books proper. Just as data about data, the so-called “metadata,” usually lies outside its information artifact.

If future generations are going to have better historical context than we do for many items, we had best get started writing them.

January 16, 2014

…Digital Asset Sustainability…

Filed under: Archives,Digital Library,Library,Preservation — Patrick Durusau @ 5:14 pm

A National Agenda Bibliography for Digital Asset Sustainability and Preservation Cost Modeling by Butch Lazorchak.

From the post:

The 2014 National Digital Stewardship Agenda, released in July 2013, is still a must-read (have you read it yet?). It integrates the perspective of dozens of experts to provide funders and decision-makers with insight into emerging technological trends, gaps in digital stewardship capacity and key areas for development.

The Agenda suggests a number of important research areas for the digital stewardship community to consider, but the need for more coordinated applied research in cost modeling and sustainability is high on the list of areas prime for research and scholarship.

The section in the Agenda on “Applied Research for Cost Modeling and Audit Modeling” suggests some areas for exploration:

“Currently there are limited models for cost estimation for ongoing storage of digital content; cost estimation models need to be robust and flexible. Furthermore, as discussed below…there are virtually no models available to systematically and reliably predict the future value of preserved content. Different approaches to cost estimation should be explored and compared to existing models with emphasis on reproducibility of results. The development of a cost calculator would benefit organizations in making estimates of the long‐term storage costs for their digital content.”

In June of 2012 I put together a bibliography of resources touching on the economic sustainability of digital resources. I’m pleasantly surprised as all the new work that’s been done in the meantime, but as the Agenda suggests, there’s more room for directed research in this area. Or perhaps, as Paul Wheatley suggests in this blog post, what’s really needed are coordinated responses to sustainability challenges that build directly on this rich body of work, and that effectively communicate the results out to a wide audience.

I’ve updated the bibliography, hoping that researchers and funders will explore the existing body of projects, approaches and research, note the gaps in coverage suggested by the Agenda and make efforts to address the gaps in the near future through new research or funding.

I count some seventy-one (71) items in this bibliography.

Digital preservation is an area where topic maps can help maintain access over changing customs and vocabularies, but just like migrating from one form of media to another, it doesn’t happen by itself.

Nor is there any “free lunch,” because the data is culturally important, rare, etc. Someone has to pay the bill for it being preserved.

Having the cost of semantic access included in digital preservation would not hurt the cause of topic maps.

Yes?

November 7, 2013

Hot Topics: The DuraSpace Community Webinar Series

Filed under: Archives,Data Preservation,DSpace,Preservation — Patrick Durusau @ 7:14 pm

Hot Topics: The DuraSpace Community Webinar Series

From the DuraSpace about page:

DuraSpace supported open technology projects provide long-term, durable access to and discovery of digital assets. We put together global, strategic collaborations to sustain DSpace and Fedora, two of the most widely-used repository solutions in the world. More than fifteen hundred institutions use and help develop these open source software repository platforms. DSpace and Fedora are directly supported with in-kind contributions of development resources and financial donations through the DuraSpace community sponsorship program.

Like most of you, I’m familiar with DSpace andFedora but I wasn’t familiar with the “Hot Topics” webinar series. I was following a link from Recommended! “Metadata and Repository Services for Research Data Curation” Webinar by Imma Subirats, when I encountered the “Hot Topics” page.

  • Series Six: Research Data in Repositories
  • Series Five: VIVO–Research Discovery and Networking
  • Series Four: Research Data Management Support
  • Series Three: Get a Head on Your Repository with Hydra End-to-End Solutions
  • Series Two: Managing and Preserving Audio and Video in your Digital Repository
  • Series One: Knowledge Futures: Digital Preservation Planning

Each series consists of three (3) webinars, all with recordings, most with slides as well.

Warning: Data curation doesn’t focus on the latest and coolest GPU processing techniques.

But, in ten to fifteen years when GPU techniques are like COBOL is now, good data curation will enable future students to access those techniques.

I think that is worthwhile.

You?

May 2, 2013

Create and Manage Data: Training Resources

Filed under: Archives,Data,Preservation — Patrick Durusau @ 2:07 pm

Create and Manage Data: Training Resources

From the webpage:

Our Managing and Sharing Data: Training Resources present a suite of flexible training materials for people who are charged with training researchers and research support staff in how to look after research data.

The Training Resources complement the UK Data Archive’s popular guide on ‘Managing and Sharing Data: best practice for researchers’, the most recent version published in May 2011.

They  have been designed and used as part of the Archive’s daily work in supporting ESRC applicants and award holders and have been made possible by a grant from the ESRC Researcher Development Initiative (RDI).

The Training Resources are modularised following the UK Data Archive’s seven key areas of managing and sharing data:

  • sharing data – why and how
  • data management planning for researchers and research centres
  • documenting data
  • formatting data
  • storing data, including data security, data transfer, encryption, and file sharing
  • ethics and consent
  • data copyright

Each section contains:

  • introductory powerpoint(s)
  • presenter’s guide – where necessary
  • exercises and introduction to exercises
  • quizzes
  • answers

The materials are presented as used in our own training courses  and are mostly geared towards social scientists. We anticipate trainers will create their own personalised and more context-relevant example, for example by discipline, country, relevant laws and regulations.

You can download individual modules from the relevant sections or download the whole resource in pdf format. Updates to pages were last made on 20 June 2012.

Download all resources.

Quite an impressive set of materials that will introduce you to some aspects of research data in the UK. Not all but some aspects.

What you don’t learn here you will pickup from interaction with people actively engaged with research data.

But it will give you a head start on understanding the research data community.

Unlike some technologies, topic maps are more about a community’s world view than the world view of topic maps.

April 6, 2013

Ultimate library challenge: taming the internet

Filed under: Data,Indexing,Preservation,Search Data,WWW — Patrick Durusau @ 3:40 pm

Ultimate library challenge: taming the internet by Jill Lawless.

From the post:

Capturing the unruly, ever-changing internet is like trying to pin down a raging river. But the British Library is going to try.

For centuries, the library has kept a copy of every book, pamphlet, magazine and newspaper published in Britain. Starting on Saturday, it will also be bound to record every British website, e-book, online newsletter and blog in a bid to preserve the nation’s ”digital memory”.

As if that’s not a big enough task, the library also has to make this digital archive available to future researchers – come time, tide or technological change.

The library says the work is urgent. Ever since people began switching from paper and ink to computers and mobile phones, material that would fascinate future historians has been disappearing into a digital black hole. The library says firsthand accounts of everything from the 2005 London transit bombings to Britain’s 2010 election campaign have already vanished.

”Stuff out there on the web is ephemeral,” said Lucie Burgess the library’s head of content strategy. ”The average life of a web page is only 75 days, because websites change, the contents get taken down.

”If we don’t capture this material, a critical piece of the jigsaw puzzle of our understanding of the 21st century will be lost.”

For more details, see Jill’s post or, Click to save the nations digital memory (British Library press release), or 100 websites: Capturing the digital universe (sample of results of archiving with only 100 sites).

The content gathered by the project will be made available to the public.

A welcome venture, particularly since the results will be made available to the public.

An unanswerable question but I do wonder how we would view Greek drama if all of it had been preserved?

Hundreds if not thousands of plays were written and performed every year.

The Complete Greek Drama lists only forty-seven (47) that have survived to this day.

If whole scale preservation is the first step, how do we preserve paths to what’s worth reading in a data labyrinth as a second step?

I first saw this in a tweet by Jason Ronallo.

March 22, 2013

Making It Happen:…

Filed under: Data,Data Preservation,Preservation — Patrick Durusau @ 9:45 am

Making It Happen: Sustainable Data Preservation and Use by Anita de Waard.

Great set of overview slides on why research data should be preserved.

Not to mention making the case that semantic diversity, in systems for capturing research data, between researchers, etc., needs to be addressed by any proffered solution.

If you don’t know Anita de Waard’s work, search for “Anita de Waard” on Slideshare.

As of today, I am getting one hundred and forty (140) presentations.

All of which you will find useful on a variety of data related topics.

August 10, 2012

….Comparing Digital Preservation Glossaries [Why Do We Need Common Vocabularies?]

Filed under: Archives,Digital Library,Glossary,Preservation — Patrick Durusau @ 8:28 am

From AIP to Zettabyte: Comparing Digital Preservation Glossaries

Emily Reynolds (2012 Junior Fellow) writes:

As we mentioned in our introductory post last month, the OSI Junior Fellows are working on a project involving a draft digital preservation policy framework. One component of our work is revising a glossary that accompanies the framework. We’ve spent the last two weeks poring through more than two dozen glossaries relating to digital preservation concepts to locate and refine definitions to fit the terms used in the document.

We looked at dictionaries from well-established archival entities like the Society of American Archivists, as well as more strictly technical organizations like the Internet Engineering Task Force. While some glossaries take a traditional archival approach, others were more technical; we consulted documents primarily focusing on electronic records, archives, digital storage and other relevant fields. Because of influential frameworks like the OAIS Reference Model, some terms were defined similarly across the glossaries that we looked at. But the variety in the definitions for other terms points to the range of practitioners discussing digital preservation issues, and highlights the need for a common vocabulary. Based on what we found, that vocabulary will have to be broadly drawn and flexible to meet different kinds of requirements.

OSI = Office of Strategic Initiatives (Library of Congress)

Not to be overly critical, but I stumble over:

Because of influential frameworks like the OAIS Reference Model, some terms were defined similarly across the glossaries that we looked at. But the variety in the definitions for other terms points to the range of practitioners discussing digital preservation issues, and highlights the need for a common vocabulary.

Why does a “variety in the definitions for other terms…highlight[s] the need for a common vocabulary?”

I take it as a given that we have diverse vocabularies.

And that attempts at “common” vocabularies succeed in creating yet another “diverse” vocabulary.

So, why would anyone looking at “diverse” vocabularies jump to the conclusion that a “common” vocabulary is required?

Perhaps what is missing is the definition of the problem presented by “diverse” vocabularies.

Hard to solve a problem if you don’t know it is. (Hasn’t stopped some people that I know but that is a story for another day.)

I put it to you (and in your absence I will answer, so answer quickly):

What is the problem (or problems) presented by diverse vocabularies? (Feel free to use examples.)

Or if you prefer, Why do we need common vocabularies?

January 5, 2012

Digging into Data Challenge

Filed under: Archives,Contest,Data Mining,Library,Preservation — Patrick Durusau @ 4:09 pm

Digging into Data Challenge

From the homepage:

What is the “challenge” we speak of? The idea behind the Digging into Data Challenge is to address how “big data” changes the research landscape for the humanities and social sciences. Now that we have massive databases of materials used by scholars in the humanities and social sciences — ranging from digitized books, newspapers, and music to transactional data like web searches, sensor data or cell phone records — what new, computationally-based research methods might we apply? As the world becomes increasingly digital, new techniques will be needed to search, analyze, and understand these everyday materials. Digging into Data challenges the research community to help create the new research infrastructure for 21st century scholarship.

Winners for Round 2, some 14 projects out of 67, were announced on 3 January 2012.

Interested to hear your comments on the projects as I am sure the projects would as well.

May 29, 2011

May 10, 2011

January 1, 2011

Lots of Copies – Keeps Stuff Safe / Is Insecure – LOCKSS / LOCII

Filed under: Data Reduction,Preservation — Patrick Durusau @ 8:46 pm

Is your enterprise accidentally practicing Lots of Copies Keeps Stuff Safe – LOCKSS?

Gartner analyst Drue Reeves says:

Use document management to make sure you don’t have copies everywhere, and purge nonrelevant material.

If you fall into the lots of copies category your slogan should be: Lots of Copies Is Insecure or LOCII (pronounced “lossee”).

Not all document preservations solutions depend upon being insecure.

Topic maps can help develop strategies to make your document management solution less LOCII.

One way they can help is by mapping out all the duplicate copies. Are they really necessary?

Another way they can help is by showing who has access to each of those copies.

If you trust someone with access, that means you trust everyone they trust.

Check their Facebook or Linkedin pages to see how many other people you are trusting, just by trusting the first person.

Ask yourself: How bad would a Wikileaks like disclosure be?

Then get serious about information security and topic maps.

November 19, 2010

October 23, 2010

CASPAR (Cultural, Artistic, and Scientific Knowledge for Preservation, Access and Retrieval)

Filed under: Cataloging,Digital Library,Information Retrieval,Preservation — Patrick Durusau @ 7:58 am

CASPAR (Cultural, Artistic, and Scientific Knowledge for Preservation, Access and Retrieval).

From the website:

CASPAR methodological and technological solution:

  • is compliant to the OAIS Reference Model – the main standard of reference in digital preservation
  • is technology-neutral: the preservation environment could be implemented using any kind of emerging technology
  • adopts a distributed, asynchronous, loosely coupled architecture and each key component is self-contained and portable: it may be deployed without dependencies on different platform and framework
  • is domain independent: it could be applied with low additional effort to multiple domains/contexts.
  • preserves knowledge and intelligibility, not just the “bits”
  • guarantees the integrity and identity of the information preserved as well as the protection of digital rights

FYI: OAIS Reference Model

As a librarian, you will be confronted with claims similar to these in vendor literature, grant applications and other marketing materials.

Questions:

  1. Pick one of these claims. What documentation/software produced by the project would you review to evaluate the claim you have chosen?
  2. What other materials do you think would be relevant to your review?
  3. Perform the actual review (10 – 15 pages, with citations, project)

Powered by WordPress