Archive for the ‘Metadata’ Category
Thursday, May 16th, 2013
Metadata Collection Strategies by Maish Nichani and Patrick Lambe.
From the post:
Metadata can be collected in many ways—from the information environment, work activities and from people. The problem arises when metadata that could be effectively collected from the environment is delegated to be collected from people. People who are in the middle of work tasks do not see direct benefits from completing numerous metadata fields. When coerced into doing unnatural things, they usually revolt or find workarounds thereby undermining the entire initiative.
In this article we share strategies to collect metadata that lower the reliance on people in supplying metadata. We cannot completely remove people from the equation but we can prevent them from doing additional work, and focus the role of people on the value added metadata that machines and environment cannot automatically supply.
Maish and Patrick suggest several places where metadata can be collected without asking users.
I would go a step further and create a topic template for collecting metadata.
For a blog, having collected the author and other information once, there really isn’t a reason to collect it for every post that appears.
The same would be true for journals, where a topic template could assist with creating domains for vocabulary usage.
For example, when searching for a genome, limiting a search to genomic research archives, avoids part numbers and other overloading of a genome identifier.
Our machines don’t have to solve searching problems without human assistance. Particularly when a small assist can pay such high dividends in search results.
Posted in Data Collection, Metadata | No Comments »
Thursday, April 11th, 2013
LODE-BD Recommendations 2.0
From the post:
LODE-BD aims to support the selection of appropriate encoding strategies for producing meaningful Linked Open Data (LOD)-enabled bibliographical data (directly or indirectly). The LODE-BD recommendations are applicable for structured data describing bibliographic resources such as articles, monographs, theses, conference papers, presentation materials, research reports, learning objects, etc. – in print or electronic format.
The core component of LODE-BD contains a set of recommended decision trees for common properties used in describing a bibliographic resource instance. Each decision tree is delivered with various acting points and the matching encoding suggestions. The full range of options presented by LODE-BD will enable data providers to make their choices according to their development stages, internal data structures, and the reality of their practices.
What's new in LODE-BD 2.0
-
Background information and references are moved into appendixes.
-
Metadata terms recommended by LODE-BD 2.0 are not limited to subject-specific domains. Agricultural-related namespaces and vocabularies are removed from the 2.0 version. LODE-BD now are appropriate for use by any data providers and repositories.
-
A road-map is added to guide the navigation of LODE-BD sections.
-
A crosswalk is added which maps the metadata terms used in the LODE-BD 2.0 with schema.org properties. It is attached as Appendix 4.
The post also breaks down the report into individual sections.
Of particular interest will be:
Appendix 4. Crosswalk of Metadata Term used in LODE-BD and schema.org terms
As with most crosswalks, the mapping does not enumerate the properties that compelled the crosswalk author to make the connections they did.
It it had, it would be easier to maintain and to merge with other crosswalks.
Posted in Crosswalk, LOD, Linked Data, Metadata | No Comments »
Wednesday, April 10th, 2013
Tag you’re it – but is your tag the same as my tag? by Fran Alexander.
From the post:
Lots of people talk about tags, and they all tend to assume they mean the same thing. However, there are lots of different types of tag from HTML tags for marking up web pages to labels in databases and this can lead to all sorts of confusion and problems in projects.
Here are some definitions of “tag” that I’ve heard and that are different in significant ways. If you think my definitions can be improved, please comment, and please let me know of any other usages of that tricksy little word “tag” that you’ve happened upon.
1) A tag is a free text keyword you add as part of the metadata of something to help search
…
2) A tag is a keyword that is selected from a controlled vocabulary or authority list
….
3) A tag is a keyword that is selected from a taxonomy
…
4) A tag is a type of Uniform Resource Identifier (URI)
….
5) A tag is metadata added to a web page for search engines to index
….
6) A tag is a label used to mark up content within a web page that can be used for display purposes and for indexing
Does your use of “tag” appear in this list?
If not, how would you define it?
Posted in Metadata, Semantic Diversity, Tagging, Taxonomy, Vocabularies | No Comments »
Thursday, February 21st, 2013
MTSR 2013: 7th Metadata and Semantics Research Conference (November 19-22, 2013)
Dates:
March 31st 2013: Title and abstract (500 words) submission – not mandatory
June 20th 2013: Paper submission
July 31st 2013: Acceptance/rejection notification
August 20th 2013: Camera-ready papers due
November 19th – 22nd 2013: Conference at Alexander Technological Educational Institute, Thessaloniki, Greece
From the post:
Continuing the successful mission of previous MTSR Conferences (MTSR'05, MTSR'07, MTSR'09, MTSR'10, MTSR'11 and MTSR’12), the seventh International Conference on Metadata and Semantics Research (MTSR'13) aims to bring together scholars and practitioners that share a common interest in the interdisciplinary field of metadata, linked data and ontologies. Participants will share novel knowledge and best practice in the implementation of these semantic technologies across diverse types of Information Environments and applications. These include Cultural Informatics; Open Access Repositories & Digital Libraries; E-learning applications; Search Engine Optimisation & Information Retrieval; Research Information Systems and Infrastructures; e-Science and e-Social Science applications; Agriculture, Food and Environment; Bio-Health & Medical Information Systems.
Scope and topics
Contributions are welcome on every topic related to Metadata and their relationships with Ontologies, Semantic Web, Knowledge Management and Software Engineering, including but not limited to:
I. Foundations
-
Typology of metadata and metadata uses
-
The value and cost of metadata
-
Quality evaluation in the use of Metadata
-
Metadata reusability
-
New or revised metadata schemas or application profiles
-
Metadata standardization
-
Empirical studies on metadata and/or ontologies usage
II. Languages and Frameworks for Metadata Management
-
SGML, XML, UML in theory and practice
-
Languages and Frameworks for Ontology Management
-
Metadata and the Semantic Web
-
Metadata and Knowledge Management
-
Metadata and Software Engineering
-
Metadata application of Semantic Web technologies
-
Ontologies and Ontology-based Knowledge Management Systems
III. Case Studies
-
Metadata and ontologies for librarianship, management of historical archives and archeological research
-
Metadata and ontologies for the design of innovative products and processes
-
Metadata and ontologies for health, biological and clinical information management
-
Metadata and ontologies in finance, tourism and public administrations
-
Metadata and ontologies in industry
-
Metadata and ontologies in education
-
Metadata and ontologies in agriculture, food and environment
IV. Technological Issues
-
Technologies for Metadata and ontology storage
-
Technologies for Metadata and ontology integration
-
Technologies for Metadata extraction and navigation, querying and editing of ontologies
-
Technologies for Learning Objects management
Conference website.
To get a better idea of the range of the conference, consult the prior proceedings:
Proceedings have been published in the Springer’s CCIS (Communications in Computer and Information Science) Series in previous events and will be published for MTSR 2013 as well.
Posted in Conferences, Metadata, Semantics | No Comments »
Monday, February 18th, 2013
Journal of Library Metadata, Special issue: Linked Data, Semantic Web and Libraries (Proposals by March 31, 2013; Full manuscripts by July 10, 2013.)
From the post:
Libraries are finding themselves on the threshold of a “new bibliographic universe.” The Semantic Web, Linked Data, and open access all promise to set library metadata free from its historical constraints. How are libraries preparing for and experimenting with sharing data in this new world of information-set-free? The general aim of this special issue of the Journal of Library Metadata is to access and present current practices, trends, and research in moving library metadata into this new environment.
Recommended topics include, but are not limited to the following:
- Libraries and Linked Data/Semantic Web
- Open access, library metadata and Linked Data/Semantic Web
- Institutional repository metadata and the Semantic Web
- Harvesting and sharing of metadata in the new environment
- Incorporating Linked Data into library information systems
- Authority control, vocabularies and Linked Data/Semantic Web
- Rights and license management in the Semantic Web
- Linked Data and MARC and non-MARC (EAD, Dublin Core, etc.) library metadata
- Migration of MARC and non-MARC library metadata to new systems and platforms
- Conversion and mapping of MARC and non-MARC library metadata to RDF and Linked Data
- Data clean-up in preparation for migration/conversion
Submission Procedure: Researchers and practitioners are invited to submit on or before, March 31, 2013, a proposal (up to 500 words) clearly explaining the objectives and concerns of his or her proposed article. Authors of accepted proposals will be notified shortly about the status of their proposals. Full manuscripts (4000-7000 words) are expected to be submitted by July 10, 2013. All submitted manuscripts will be reviewed on a double-blind review basis. Please forward submissions electronically (Word document) to the guest editor at sheila.bair@wmich.edu.
The journal itself: Journal of Library Metadata.
Posted in Librarian/Expert Searchers, Library, Metadata | No Comments »
Monday, February 11th, 2013
NISO Launches New Initiative to Develop Standard for Open Access Metadata and Indicators
NISO voting members have approved a new project to develop standardized bibliographic metadata and visual indicators to describe the accessibility of journal articles with respect to how “open” they are.
Many offerings are available from publishers under the banner of Open Access (OA), Increased Access, Public Access, or other names; the terms offered vary both between publishers and within publishers by journal, and in some cases, based on the funding organization of the author. Adding to the potential confusion, a number of publishers also offer hybrid options in which some articles are “open” while the rest of the journal’s content are available only by subscription or license. No standardized bibliographic metadata currently provides information on whether a specific article is openly accessible and what re-use rights might be available to readers. Visual indicators or icons indicating the openness of an article are inconsistent in both design and use across publishers or even across journals from the same publisher.
The project launched by NISO will focus initially on metadata elements that describe the readership rights associated with an OA article. Specifically, the NISO Working Group will determine the optimal mechanisms to describe and transmit the rights, if any, an arbitrary user has to access a specific article from any internet connection point.
Recommendations will include a means for distribution and aggregation of this metadata in machine-readable form. The group will also consider the feasibility of incorporating information on re-use rights and the feasibility of reaching agreement on transmission of that data.
An important standard for topic maps that have a choice in pointing towards online resources.
If a publisher allows open access only to an abstract, could be useful to locate an author’s website to check for an “unofficial” copy of the full text.
Posted in Metadata, Open Access | No Comments »
Friday, November 30th, 2012
HTML metadata for journal articles by Alf Eaton.
From the post:
You’d think it would be easy to pin down an ontology for journal articles. There are basically just these properties:
- title
- authors[]
- datePublished
- abstract
But… some of those are shared with more generic classes higher up the tree, so abtract becomes description, title becomes name, author becomes creator. Each author can be a string or an object. Each author has one or more affiliations, which have addresses. The authors are in a specific order, and some of them have certain roles. There are several different dates: creation, review, update, publication.
A good listing of all the various options for bibliographic metadata.
A variety that arose in the last ten (10) years. If we pushed that back twenty (20) or thirty (30) years, even more diversity.
All of the systems mentioned are useful in their original contexts and will be supplanted by other systems over time.
But unlike humans, the components of the so-called “Semantic Web” don’t adapt to change. Or should I say they don’t adapt without the assistance of their human authors?
Any change to an ontology forces more work onto their maintainers. Perhaps that accounts for static ontologies that don’t account for prior diversity or that will will surely follow.
I have always thought, “X works with my software,” as a poor reason to adopt any particular approach. I much prefer approaches that meet my requirements, not those of a software vendor.
I first saw this in a tweet by Duncan Hull.
Posted in Bibliography, HTML, Metadata, Ontology | No Comments »
Thursday, September 27th, 2012
Hadoop and Metadata (Removing the Impedance Mis-match) by Alan Gates, Russell Jurney.
From the post:
Apache Hadoop enables a revolution in how organization’s process data, with the freedom and scale Hadoop provides enabling new kinds of applications building new kinds of value and delivering results from big data on shorter timelines than ever before. The shift towards a Hadoop-centric mode of data processing in the enterprise has however posed a challenge: how do we collaborate in the context of the freedom that Hadoop provides us? How do we share data which can be stored and processed in any format the user desires? Furthermore, how do we integrate between different tools and with other systems that make-up data-center as computer?
As a Hadoop user, the need for a metadata directory is clear. Users don’t want to ‘reinvent the wheel’ and repeat the work of others. They want to share results and intermediate data-sets and collaborate with colleagues. Given the needs of users, the case for a generic metadata mechanism on top of Hadoop is easy to make: increased visibility into data assets by registering them with a metadata registry for discovery and sharing enables increased efficiency. Less work for the user.
Users also want to be able to use different tool-sets and systems together – Hadoop and non-Hadoop alike. As a Hadoop user, there is a clear need for interoperability among the diverse tools on today’s Hadoop cluster: Hive, Pig, Cascading, Java MapReduce and streaming Python, C/C++, perl, and ruby with data stored in formats from CSV, TSV, Thrift, Protobuf, Avro, SequenceFiles, Hive’s RCFile as well as proprietary formats.
Finally, raw data does not usually originate on the Hadoop Distributed Filesystem. There is a clear need for a central point to register resources from different kinds of systems for ETL onto HDFS, and to publish results of analyses on Hadoop onto other systems.
Sounds topic mappish doesn’t it?
Marketable HCatalog data products anyone?
I first saw this at Hortonworks.
Posted in HCatalog, Hadoop, Metadata | No Comments »
Tuesday, September 18th, 2012
Scholarly metadata from R
From the post:
Metadata! Metadata is very cool. It’s super hot right now – everybody is talking about it. Okay, maybe not everyone, but it’s an important part of archiving scholarly work.
We are working on a repo on GitHub rmetadata to be a one stop shop for querying metadata from around the web. Various repos on GitHub we have started – rpmc, rdatacite, rdryad, rpensoft, rhindawi – will at least in part be folded into rmetadata.
As a start we are writing functions to hit any metadata services that use the OAI-PMH: “Open Archives Initiative Protocol for Metadata Harvesting” framework. OAI-PMH has six methods (or verbs as they are called) for data harvesting that are the same across different metadata providers:
GetRecord
Identify
ListIdentifiers
ListMetadataFormats
ListRecords
ListSets
OAI-PMH provides an updating list of data providers, which we can easily use to get the base URLs for their data. Then we just use one of the six above methods to query their metadata.
Re-using metadata is a lot easier than creating all new metadata.
Not to mention avoiding creating new metadata that is inconsistent with existing metadata.
Posted in Metadata, OAI, R | No Comments »
Saturday, August 11th, 2012
What kinds of metadata are important anyway? by Curt Monash.
From the post:
In today’s post about HCatalog, I noted that the Hadoop/HCatalog community didn’t necessarily understand all the kinds of metadata that enterprises need and want, especially in the context of data integration and ETL and ELT (Extract/Transform/Load/Transform). That raises a natural question — what kinds of metadata do users need or want? In the hope of spurring discussion, from vendors and users alike, I’m splitting this question out into a separate post.
Please comment with your thoughts about ETL-related metadata needs. The conversation needs to advance.
In the relational world, there are at least three kinds of metadata:
- Definitional information about data structures, without which you can’t have a relational database at all. That area seems binary; either you have enough to make sense of your data or you don’t.
- Statistics about columns and tables, such as the most frequent values and how often they occur, which are kept for the purpose of optimization. Those seem to be nice-to-haves more than must-haves. The more information of this kind you have, the more chances you have to save resources.
- Historical and security information about data. This is where things get really complicated. It’s also where Hadoop is still in the “So what exactly should we build?” stage of design.
I would assume that data structures are meant to carry information, possibly even identification, of one or more subjects.
Seems odd to me that what subjects are meant to be identified, much less what identifies those subjects, goes unmentioned in Carl’s post.
Not mentioning subjects and their identifications works, obviously, because sys admins migrate data year in and year out, more or less successfully, at a fairly high cost, but it works.
If we knew what subjects whose data was being stored and how they were identified, migration to other systems would be less uncertain and less costly.
Oh, that reminds me, have you decided what “key finding” Oracle left out of its summary? It comes up here as well. Monday is only a day or so away.
Posted in Database, Metadata | No Comments »
Friday, July 27th, 2012
New Standards for Language Studies
Important dates:
Registration (submission): September 30th, 2012
Notification of acceptance: October 10th, 2012
Workshop: Paris, November, 15-16, 2012
From the call for papers:
We are pleased to announce an international interdisciplinary workshop to be held at Paris-Sorbonne University on November 15th and 16th, 2012. The symposium is for the discussion of the Interactive Linguistics methods jointly with the Associative Semantics (AS) and Meta-Informative Centering Theory (MIC) currently developed at CELTA (Centre de Linguistique Théorique et Appliquée).
The 3rd Workshop aims at discussing foundational issues of language studies introducing interactive methods which are step by step being derived from research in the AI domain, especially Knowledge Discovery from Databases (KDD) and diverse programming paradigms.
Keywords: information – semantic roles – meta-information – attention-driven centring of utterance – subjecthood and topicality – predication – context – speech acts – modularity – multi-agent systems – distributed systems.
For further information, please visit the CELTA website pages: http://www.celta.paris-sorbonne.fr/forum/
Importantly, we sincerely encourage researchers (linguists, computer scientists, psychologists, neurologists, logicians and philosophers) who are interested in the research topics specified in the preliminary discussion forum to join us by sending their name and surname(s), affiliation and e-mail address to: celta@paris-sorbonne.fr
You will receive a free subscription account for posting your replies and proposals to the Forum.
Computationally rigorous but from what I read on the forum, definitely a European approach.
A reminder that computer science as experienced in the United States was at one time a healthier mix of humanists, linguists, mathematicians, and computer scientists. Before it became a “profession/discipline.”
Posted in Conferences, Information Science, Language, Metadata | No Comments »
Wednesday, July 11th, 2012
Webtracks
From the webpage:
This project will develop an approach and mechanism to address the construction and propagation of linked data in the context of research and academic endeavour. The proposed work will build experiments in previous projects (Claddier, StoreLink) to develop a peer-to-peer protocol to underpin the construction of a web of linked data. This set of semantically annotated links between data resources forms a graph of citation and provenance and the project will build value added services to exploit these features.
I ran across this in the following description of a presentation at a library conference:
Inter-repository Linking of Research Objects with Webtracks
This session being presented by Shirley Ying Crompton. Shirley describing how the research process leads to research data and outputs being stored in different places with no links between them. So decided to use RDF/linked data to added structured citation links between research objects (and people – e.g. creators).
However, different objects created in different systems – so how to make sure objects are linked as they are created? Looked at existing protocols for enabling links to be created:
- Trackbacks – use for blogs/comments
- Semantic pingback – an RPC protocol to form semantic links between objects
- Salmons – RSS protocol
- …
Decided to take ‘webtracks’ approch – this is an inter-repository communication protocol. The Webtracks InteRCom protocol – allows formation of links between objects in two different repositories. InteRCom is two stage protocol – first stage is ‘harvest’ to get links, then second stage ‘request’ a link between two objects.
InteRCom implementation has been done in Java, available as open source – available for download from http://sourceforge.net/projects/webtracks/.
Shirley says: Webtracks facilitates propagation of citation links to provide a linked web of data – uses emerging linked data environment and support linking between diverse types o digital research objects. There are no constraints on link semantics or metadata. Importantly (for the project) is that it does not rely on centralised service – it is peer-to-peer. [Repository Services]
The key insight is “…different objects created in different systems….” (emphasis added)
That condition is, has been and will be true, no matter what solution you decide upon.
Posted in Archives, Linked Data, Metadata, RDF | No Comments »
Friday, June 29th, 2012
Analysis and synthesis of metadata goals for scientific data by Craig Willis, Jane Greenberg and Hollie White. (Willis, C., Greenberg, J. and White, H. (2012), Analysis and synthesis of metadata goals for scientific data. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22683)
Abstract:
The proliferation of discipline-specific metadata schemes contributes to artificial barriers that can impede interdisciplinary and transdisciplinary research. The authors considered this problem by examining the domains, objectives, and architectures of nine metadata schemes used to document scientific data in the physical, life, and social sciences. They used a mixed-methods content analysis and Greenberg’s () metadata objectives, principles, domains, and architectural layout (MODAL) framework, and derived 22 metadata-related goals from textual content describing each metadata scheme. Relationships are identified between the domains (e.g., scientific discipline and type of data) and the categories of scheme objectives. For each strong correlation (>0.6), a Fisher’s exact test for nonparametric data was used to determine significance (p < .05).
Significant relationships were found between the domains and objectives of the schemes. Schemes describing observational data are more likely to have “scheme harmonization” (compatibility and interoperability with related schemes) as an objective; schemes with the objective “abstraction” (a conceptual model exists separate from the technical implementation) also have the objective “sufficiency” (the scheme defines a minimal amount of information to meet the needs of the community); and schemes with the objective “data publication” do not have the objective “element refinement.” The analysis indicates that many metadata-driven goals expressed by communities are independent of scientific discipline or the type of data, although they are constrained by historical community practices and workflows as well as the technological environment at the time of scheme creation. The analysis reveals 11 fundamental metadata goals for metadata documenting scientific data in support of sharing research data across disciplines and domains. The authors report these results and highlight the need for more metadata-related research, particularly in the context of recent funding agency policy changes.
The authors remark on the scope of metadata:
Scope is a broad term, but is commonly used in the software requirements and metadata communities to identify what is included as part of a system or scheme. In the context of metadata for scientific data, it seems that each community has scoped their metadata based on discipline-specific needs and practices. This observation makes sense, given that the metadata efforts examined are initiated within silos, embedded in the scientific practice of the community. To extend this research, it seems that more questions are needed to address these fundamental requirements in the context of communities’ approaches to science and communication.
And later advocate the study of metadata in broader communities to break down the barriers created by silos.
The quest to avoid/abandon silos is a quixotic one.
A more realistic goal would be to build a bigger silos that encompasses related areas of science/data by providing trans-silo mappings from more specialized silos.
Any metadata that we establish today will be a “silo” when viewed ten years hence.
We can create mechanisms that ease our transition from one silo to the next or continue the pretense we can move beyond them.
Which will you choose?
Posted in Metadata, Science, Silos | No Comments »
Tuesday, March 20th, 2012
Metadata Management in Scientific Computing by Eric L. Seidel.
Abstract:
Complex scientific codes and the datasets they generate are in need of a sophisticated categorization environment that allows the community to store, search, and enhance metadata in an open, dynamic system. Currently, data is often presented in a read-only format, distilled and curated by a select group of researchers. We envision a more open and dynamic system, where authors can publish their data in a writeable format, allowing users to annotate the datasets with their own comments and data. This would enable the scientific community to collaborate on a higher level than before, where researchers could for example annotate a published dataset with their citations.
Such a system would require a complete set of permissions to ensure that any individual’s data cannot be altered by others unless they specifically allow it. For this reason datasets and codes are generally presented read-only, to protect the author’s data; however, this also prevents the type of social revolutions that the private sector has seen with Facebook and Twitter.
In this paper, we present an alternative method of publishing codes and datasets, based on Fluidinfo, which is an openly writeable and social metadata engine. We will use the specific example of the Einstein Toolkit, a shared scientific code built using the Cactus Framework, to illustrate how the code’s metadata may be published in writeable form via Fluidinfo.
There are a number of interesting aspects to the proposal, such as nodes that collect tags but have no ownership or semantics. Not to mention that metadata can be made not only readable but writeable by others. I would disagree with ever allowing recorded metadata to change but that is a debatable point.
This is a rare paper that concludes:
Scientic research is increasingly dependent on the simulation of complex processes and, by extension, on the ability to organize, search, and refer to the datasets generated by simulations. We propose using writable metadata to distribute and maintain scientic metadata, and have shown one possible method of implementing such a system. More work will be required to investigate alternative systems, schemas, and interfaces, as well as to determine what would be an optimal solution. We hope that the scientic community will take this opportunity to start a conversation about how to manage the large amounts of data currently being generated by our research on a daily basis.
A little spirit of continuing investigation goes a long way.
Posted in HPC, Metadata, Scientific Computing | No Comments »
Sunday, February 26th, 2012
WWW 2012 Metadata Challenge
Dates:
Submission deadline: March 5th, 2012
Selection of featured applications: 15th April, 2012
WWW Conference (attendees can test the applications): April 16th-20th, 2012
You need to register to get notice of the data dump.
In connection with the WWW 2012 conference in Lyon, France. Either one being a good reason to participate.
From the webpage:
The metadata committee is organising this year a challenge for developers in order to show case the utility of the conference metadata. Since 2007, the WWW conference is providing its metadata about papers, authors, programme, location, committees in RDF. The datasets are made available at the Semantic Web Dog Food web portal, which offers various access means to Linked Data applications. While the mere publication of such data is useful to contribute to the Web of Data, this year we will provide more diverse ways of browsing the data by encouraging data hackers to develop applications that leverage the conference metadata. To spice this up a little bit, we challenge developers to make the most interesting, most original tool and we will reward the best one with a prize. The submitted software must be made available as Web applications that the attendees can try on their laptop or their mobile phones.
Challenge criteria
The selection of the metadata tools is made according to minimal criteria. All applications that match these criteria will be made available for the attendees to try, no matter how many and how good they are. However, only few of them will be selected as ““featured application”s” before the conference, and one will be selected as best metadata application and the developers will be rewarded with a prize.
In addition to the minimal requirements, we provide additional desirable features that submissions should exhibit.
Minimal requirements
- The application should be an end-user application, that is, an application that general Web users can interact with or, even better, that WWW Conference attendees can play with during the conference.
- The application must use the data provided by the WWW Conference and exploit as much of it as possible.
- The application must be either a Web application, accessible and usable via a Web browser, or a smart phone application. There should not be any configuration files to modify and no extra library to install.
Additional Desirable Features
In addition to the above minimal requirements, we note other desirable features that will be used as criteria to evaluate submissions.
- The application provides an attractive and functional interface (for human users).
- The application may use additional information sources that:
- can be under diverse ownership or control;
- should have some connection with the conference metadata (for instance, a dataset about points of interest in Lyon);
- may be in different formats other than RDF.
- The application should provide features that are useful to conference attendees.
- Functionality is different from or goes beyond pure information retrieval.
- The application is useful beyond WWW 2012 and may be used with other types of data (not necessarily conference data).
- Multimedia documents are used in some way.
- There is a use of dynamic data (e.g., workflows, GPS location), in combination with static information.
- There is support for multiple languages and accessibility on a range of devices.
- The application is usable by and adapted for mobile devices.
Posted in Contest, Metadata, WWW | No Comments »
Wednesday, February 1st, 2012
Ookaboo RDF Dump
From the webpage:
The Ookaboo RDF dump contains metadata for nearly 1,000,000 public domain and Creative Commons images of more than 500,000 precise topics such as places, people and organism classifications linked to DBpedia and Freebase.
Expressed in industry standard RDF, the Ookaboo dump is available with a CC-BY-SA/3.0 license that is friendly for both academic and commercial use. Precision is in excess of 0.98, enabling a new range of applications for image search and classification. See the latest release for downloads and documentation.
Interesting source of metadata on images. Most recent dump was released January 23, 2012.
Posted in Dataset, Metadata, RDF | No Comments »
Saturday, January 21st, 2012
Mining information across multiple domains: A case study of application to patent laws and regulations in biotechnology by Hang Yu, Siddharth Taduri, Jay Kesan, Gloria Lau and Kincho H. Law.
Abstract:
In this paper, we present a framework that can process a user query for retrieval of information from documents of different properties across multiple domains, with specific application to patent laws and regulations. The framework has three basic components. The first component is ontology mapping and generation. What happens is that the keywords entered by users are mapped into a subset of relevant keywords. This step is performed by looking up those words in an ontology database. The second component is the joint and cross search in various document domains; in our case, they are patents and scientific publications. The last component is to modify the search results by applying user feedback statistics. The results of feedback will be saved as metadata for future uses.
A case example is given to demonstrate how results from multiple domain searches can be combined using ontology and cross referencing. We use an example of well-known biotechnology patents on erythropoietin (EPO) and give detailed analysis on each document domain with this keyword. Relationships between each domain are demonstrated.
A user feedback mechanism is also discussed in this paper. The ability to take user feedback into the framework is important. There is no doubt that domain knowledge from expert or experienced users could be a very good compliment to the proposed system. Both direct and indirect user feedbacks are discussed.
The full text of this article is available now so I suggest that you grab a copy. Apparently some content from the journal is freely available but older material is not.
This a *must read* article.
I particularly liked the use of statistical user feedback to drive the feed back process. Not as exact as having experts curate every mention but a lot less expensive at the same time.
So, do all the NLP, statistics, probability, data mining, etc., posts seem a bit more relevant to topic maps now?
No one method or approach is going to produce as good a result as taking the strong parts from a number of approaches and being willing to consider both additions as well as deletions to your method matrix.
Posted in Data Mining, Interface Research/Design, Law, Metadata, Patents, Statistics | No Comments »
Tuesday, January 17th, 2012
Jim Harris has a trilogy of posts on metadata that I recently stumbled across:
The Metadata Crisis
The Metadata Continuum
You Say Potato and I Say Tater Tot
A very nuanced view of metadata written from a business perspective, with business examples.
Something you can recommend to business clients to prep them for discussions about topic maps and their information systems.
The second post in the trilogy had this quote which I reproduce in its entirety:
The Metadata of Babel
Another insightful comment came from Peter Benson, based on his work with the eOTD (ECCMA Open Technical Dictionary).
“Mention the word metadata,” Benson explained, “and you have immediately lost all but the hard core techies and they have neither the authority nor the budget to solve the problem. If you take a hard look at the financial crisis or cancer research you will indeed find the reason the challenges are so difficult to solve is in large part because of the limitations in our ability to communicate effectively and the lack of transparency that comes from poor data integration. So, metadata is really important.”
“The Babel approach of a single language to unite them all,” Benson continued, “has a very poor track history and there is good reason for this. Language is more about power and authority than it is about true communication. We have tried to come up with a solution that is solely focused on achieving unambiguous communication. It really does not matter what it is called as long as we agree on what it is. We do this by using terminology to define concepts and then assigning concept identifiers that are used as metadata. The separation of the terminology from the concept identifier, or rather linking terminology through a concept identifier, allows everyone to remain comfortably in their own space yet communicate with others.”
Question: Has anyone mentioned this to the W3C Semantic Web folks?
PS: Reading Jim’s blog, OCDQ Blog: Obsessive-Compulsive Data Quality is also recommended.
Posted in Business Intelligence, Data Management, Metadata | No Comments »
Sunday, January 8th, 2012
Adding metadata to variables
From the post:
There are only really two ways to preserve your statistical analyses. You either save the variables that you create, or you save the code that you used to create them. In general the latter is much preferred because at some point you’ll realise that your model was wrong, or your dataset has changed, and you need to re-run your analysis. If you only stored your variables then you are now stuck rewriting your code in order to create new versions, which is really not fun. On the other hand, if you saved your code, all your have to do is tweak it and run it.
Occasionally though, just keeping the code and rerunning an analysis isn’t practical. The most obvious case being when it takes a long time. If your model takes more than ten minutes to run, it can be really useful to save its variables as well as the source code.
The problem with saving variables is that when you come back and load them six months later, it isn’t always obvious what they are or where they came from. With code, we solve this by using comments to jog our memory, so it would be nice to have an equivalent for variables. In fact, in R, such a facility exists with the – you guessed it – comment function.
library(lattice)
comment(barley) <- "Immer's barley data, 1934. The data from the Morris site may have the wrong years."
comment(barley)
The comment function simply stores the string as an attribute of the variable, with some special rules on printing. Other common attributes that you may be familiar with are names for vectors and lists, and dim and dimnames for matrices.
Used here to store information about variables but no apparent barrier to storing information about other parts of a program.
With a little structure, this could become the "just enough" semantic data to make re-use and interchange possible.
Posted in Metadata, Semantics | No Comments »
Thursday, January 5th, 2012
Everlasting Metadata?
Cynthia Murrell writes of the concern of some groups who want permanent metadata on digital objects:
On the other hand, there’s the law of unintended consequences. There is also the question of “language drift.” If metadata are not up to date, the searcher of the future might not be able to locate the information object because the search term does not match the metadata’s lingo.
And raises the more pragmatic concern of what happens when metadata needs to be corrected?
Curious, do institutions return to digital objects to update them to have the latest metadata?
My impression is that they don’t, but that is based on my experiences with mostly libraries and card catalog data.
Suggestions? Comments for where to look?
A topic map would enable the preservation of the original metadata and updates to that metadata.
Anyone know of digital preservation efforts that plan on preservation of metadata and its updating?
Posted in Library, Metadata, Museums, Preservation | No Comments »
Thursday, December 22nd, 2011
Textus: an open source platform for working with collections of texts and metadata by Jonathan Gray.
A proposal, based on WordPress as the main CMS, to work with digital texts.
Not to be critical, as all efforts have to start somewhere, but the TEI – Text Encoding Initiative in 1987, was a late-comer to the digital text party, which had started in the late 1950′s or early 1960′s. Efforts starting now, are even later late-comers to the digital text party. And there were efforts before and after the TEI project.
To be quite honest, I don’t know of any complete survey of all the digital text projects and/or repositories. Much less an inventory of the methods, encodings and texts that were the objects of such projects.
The use of topic maps in such a project would make the integration of prior and following efforts with digital texts much easier.
Posted in Metadata, Text Corpus | No Comments »
Wednesday, December 21st, 2011
Yahoo! Opens Content Analysis Technology to all Developers
From the post:
As the premier digital media company, Yahoo! publishes tons of content every day. In addition to publishing it, we do a lot of work behind the scenes to analyze and understand that content in a scalable and algorithmic way. Today we’re pleased to open up our content analysis technology to the world to help developers build their own fantastic experiences for their sites and users.
The newly launched Yahoo! Content Analysis service replaces Yahoo!’s popular Term Extraction service and now provides advanced content analysis on either text or a URL, leverages Yahoo!’s state of the art machine-learned ranking (MLR) technology to extract key terms from the content, and, more importantly, to rank them based on their overall importance to the content. The output you receive contains the keywords and their ranks along with other actionable metadata.
Our new service replaces the current Term Extraction Service, which is expected to end on March 31, 2012. We will continue to support the Term Extraction requests, but calls must be directed to our YQL table since we’ll be shutting down the non-YQL service. More details can be found on today’s YDN blog post.
The new features and MLR are supported only in the new request format. Give it a try today!
A very good demonstration of a post I am working on called: Metadata Without Markup (or not much). The premise is that a little intelligence on the front-end can yield a harvest of useful metadata on the backend, with little effort from users.
Which, to be honest, has been the sticking point of most semantic technologies. If you think programmers are lazy, you haven’t seen many users.
Posted in Content Analysis, Metadata, Yahoo! | No Comments »
Thursday, December 15th, 2011
5 Simple Provenance Statements
From the webpage:
Providing easily processable information about the provenance or origins of Web pages and data is important. It lets us give credit where its due and it helps others trust the information we publish on the Web.
Here’s some simple provenance statements one can make using PROV-DM, the recently released working draft of a data model for provenance from the W3C.
Evaluate PROV-DM in light of two concerns:
1) Does it allow for the expression of different ways of expressing provenance? Consider the differing museum metadata standards for provenance. As just a tiny corner of that world, see: Introduction to Controlled Vocabularies by Patricia Harpring (online version).
2) On the other hand, is it too restrictive and complex for simple provenance statements by the average user?
Hard to fail by being too general (#1) and being too restrictive (#2) at the same time but odder things have happened in discussions of semantics.
Posted in Metadata, Provenance, Semantic Web, Semantics | No Comments »
Thursday, December 1st, 2011
DC-2012 Metadata for Meeting Global Challenges 3-7 September 2012, Kuching, Sarawak, Malaysia
DEADLINES & IMPORTANT DATES:
Submission Deadline: 23 March 2012
Author Notification: 25 May 2012
Final Copy: 29 June 2012
From the call for papers:
DC-2012 will explore the global, national and regional roles of metadata in addressing global challenges such as food security, the digital divide, and sustainable development. Metadata plays a significant role globally in information systems shaping how we know, monitor and change social and governmental systems affecting everything from the environment, human rights and justice to education and peace. DC-2012 will bring together in Kuching the community of metadata scholars and practitioners to engage in the exchange of knowledge and best practices in developing languages of description to meet these global challenges. Beyond the conference theme, papers, reports, and poster submissions are welcome on a wide range of metadata topics, such as:
- Metadata principles, guidelines, and best practices
- Metadata quality (methods, tools, and practices)
- Conceptual models and frameworks (e.g., RDF, DCAM, OAIS)
- Application profiles
- Metadata generation (methods, tools, and practices)
- Metadata interoperability across domains, languages, time, structures, and scales.
- Cross-domain metadata uses (e.g., recordkeeping, preservation, curation, institutional repositories, publishing)
- Domain metadata (e.g., for corporations, cultural memory institutions, education, government, and scientific fields)
- Bibliographic standards (e.g., RDA, FRBR, subject headings) as Semantic Web vocabularies
- Accessibility metadata
- Metadata for scientific data, e-Science and grid applications
- Social tagging and user participation in building metadata
- Usage data (paradata/attention metadata)
- Knowledge Organization Systems (e.g., ontologies, taxonomies, authority files, folksonomies, and thesauri) and Simple Knowledge Organization Systems (SKOS)
- Ontology design and development
- Integration of metadata and ontologies
- Search engines and metadata
- Linked data and the Semantic Web (metadata and applications)
- Vocabulary registries and registry services
Posted in Conferences, Dublin Core, Metadata, RDF, Semantic Web | No Comments »
Wednesday, November 9th, 2011
The Metadata Continuum by Jim Harris.
Deeply interesting post on metadata and the failure of the single language approach.
Harris quotes Peter Benson as saying:
“The Babel approach of a single language to unite them all,” Benson continued, “has a very poor track history and there is good reason for this. Language is more about power and authority than it is about true communication. We have tried to come up with a solution that is solely focused on achieving unambiguous communication. It really does not matter what it is called as long as we agree on what it is. We do this by using terminology to define concepts and then assigning concept identifiers that are used as metadata. The separation of the terminology from the concept identifier, or rather linking terminology through a concept identifier, allows everyone to remain comfortably in their own space yet communicate with others.”
Harris continues:
So it would appear that we face a daunting challenge, which we could call the Metadata Continuum, where at one end we have the uniformity of controlled vocabularies, and at the other end we have the flexibility of chaotic folksonomies. The daily business operations of most organizations are governed by a metadata strategy that falls somewhere in between, which begs the question: In which direction should the best practices of metadata management flow—toward flexibility or toward uniformity?
The question of multiple “controlled vocabularies” or “chaotic folksonomies,” all meaning the same thing, doesn’t seem to have come up.
And I would replace Harris’s closing question:
Where along the Metadata Continuum is your organization?
with:
Where along the Metadata Continuum does your organization need to be?
Or perhaps better:
Where along the Metadata Continuum does your organization need to be for particular activities?
It maybe that there are substantial savings for a small group of organizations to agree on a uniform vocabulary for some purposes but remain diverse in other areas. Or for an organization to create mappings of vocabularies for news reports on raw materials and supplies. And to tolerate a lot of semantic noise in other areas. Depends on the anticipated semantic ROI.
Seems like that is a question that goes unanswered, even by the MDM (master data management) crowd. Oh, a lot of hand waving about the general case but I can’t write the general case savings on a deposit slip and take it to the bank. I need to know what my savings/ROI from MDM, Semantic Web, Topic Maps, etc., are going to be for my company.
Not an easy question to answer but I think customers deserve our best efforts at giving them a testable answer to such questions.
Posted in Metadata | No Comments »
Thursday, October 27th, 2011
Unsolicited advice for large governmental data providers
From the post:
We source data from a number of large national, and trans-national, statistical bodies, like the Office of National Statistics here in the UK, or Eurostat. Downloading useful data from organizations like this is sometimes a tricky job – although publishing data is usually part of their raison d’être, they’re not usually thinking of people like us – Big Data geeks – when making their data available. And often, their methods of making data available have been essentially unchanged for the past ten or fifteen years, and even then are probably based on processes predating the Internet.
One of the sources of value Timetric adds is simply making this data more widely available and accessible. But it’s also true that there’s so much more we could do if we could put our minds to using this data in new and exciting ways, rather than expending expertise on working out the best way to map old-fashioned data publication workflows to a web-centric way of working. So it’s an interesting question to ask – in an ideal world, how would a large statistical organization publish data for us?
There’s three aspects to this question:
- Data transfer and formats
- Metadata formats and reconciliation
- Update frequency and notifications
The advice seems UK/Euro centric to me, which works given their audience.
My question: How would you change this for governments located in other countries? Pointers to data format documentation should be included. 2-3 pages.
Posted in Data Mining, Metadata | No Comments »
Wednesday, September 28th, 2011
Camilstore
From the webpage:
Camlistore is:
- a way to store, sync, share, model and back up content
- a work in progress
- Open Source (Apache licensed)
- an acronym for “Content-Addressable Multi-Layer Indexed Storage”, hinting that Camlistore is about:
- content-addressable storage
- separate interoperable parts (storage, sync, sharing, modeling), with well-defined protocols and roles
- your “home directory for the web”
….
If I am reading the website correctly, the project hopes to deploy a “write-only” solution.
Any estimation of data storage capacity, even in the short-term is going to sound lame when the short-term arrives. I don’t monitor the storage literature closely but even I have heard of 3-D storage and crystalline storage projects. A “write-only” future may not be that far off.
Whether it arrives or not, my concern is what role topic maps can play in systems where storage is always increasing?
One role for which topic maps seems very well suited is to host the mappings between formats, even archival formats, in which data is stored. In the twenty or so years since digital preservation has become a topic among curators and museums, there have been more metadata and storage format proposals than I an easily remember. And data has been stored in all of them, some in multiple versions of the same proposals. I don’t know of any reason to expect the turnover in storage metadata and/or formats to slow in the future.
Topic maps can store a two-way migration between metadata and storage formats so that users of older or newer software/formats are not disadvantaged with regard to stored data. That will require active maintenance of the mapping topic maps but that will be the case with any robust solution.
Posted in Camilstore, Metadata, NoSQL, Storage | No Comments »
Friday, September 23rd, 2011
Crossref’s gift of metadata by Johnathan A Rees.
From the post:
I was delighted to learn of Crossref’s April 20 announcement (press release ; Geoff Bilder’s blog post) that they are making their DOI metadata available in RDF via HTTP. This is a significant development for scholarship on the Web and an important step toward a fully open and reliable scholarly edifice.
For those of you not familiar with this database, it has about 46 million records (and growing), keyed by strings called “digital object identifiers”. DOIs are similar to the ISBNs used for books, but are applied at a finer level of granularity – mainly for academic research articles published in the past 10 years, but with coverage steadily growing. Each record has basic bibliographic metadata for its “object” such as author, title, publisher, publication date.
It occurred to me to check if Crossref was participating in ORCID (Open Researcher & Contributor ID) and they are. Could well make a difference for articles being currently published as well as for the last ten (10) years, not sure who is going to fund mining the older material.
Posted in Bibliography, Dataset, Metadata | No Comments »
Thursday, September 1st, 2011
CKAN – the Data Hub Software
From the website:
We want to make easy for people to find, share and reuse data, be they research scientists or civil servants, data nerds or the average citizen. We aim to provide a platform that’s both simple and powerful and as easy to build on and extend as to use and interact with.
OK, but the core “metadata” is roughly:
- unique name
- title
- url + download url
- author/maintainer info
- license
- notes
- tags
- [extendable with “extra” fields]
I suppose I should not be too critical as it wasn’t that many years ago that just obtaining data in electronic form was difficult and discussions of how to store/manipulate data stores were somewhat theoretical.
I mention it here so if you encounter one of these “data hubs” in the field you won’t be expecting too much.
Posted in CKAN, Information Reuse, Metadata | 1 Comment »