Metadata « Another Word For It

August 31, 2017

Introducing KSQL…

Filed under: Data Streams,ETL,Kafka,KSQL,Metadata,Topic Maps — Patrick Durusau @ 8:26 pm

Introducing KSQL: Open Source Streaming SQL for Apache Kafka by Neha Narkhede.

From the post:

I’m really excited to announce KSQL, a streaming SQL engine for Apache Kafka™. KSQL lowers the entry bar to the world of stream processing, providing a simple and completely interactive SQL interface for processing data in Kafka. You no longer need to write code in a programming language such as Java or Python! KSQL is open-source (Apache 2.0 licensed), distributed, scalable, reliable, and real-time. It supports a wide range of powerful stream processing operations including aggregations, joins, windowing, sessionization, and much more.

…

What does it even mean to query streaming data, and how does this compare to a SQL database?

Well, it’s actually quite different to a SQL database. Most databases are used for doing on-demand lookups and modifications to stored data. KSQL doesn’t do lookups (yet), what it does do is continuous transformations— that is, stream processing. For example, imagine that I have a stream of clicks from users and a table of account information about those users being continuously updated. KSQL allows me to model this stream of clicks, and table of users, and join the two together. Even though one of those two things is infinite.

So what KSQL runs are continuous queries — transformations that run continuously as new data passes through them — on streams of data in Kafka topics. In contrast, queries over a relational database are one-time queries — run once to completion over a data set—as in a SELECT statement on finite rows in a database.
… (emphasis in original)

And if that wasn’t tempting enough:

…

3. Online data integration
CREATE STREAM vip_users AS
SELECT userid, page, action 
FROM clickstream c 
LEFT JOIN users u ON c.userid = u.user_id
WHERE u.level = 'Platinum';
Much of the data processing done in companies falls in the domain of data enrichment: take data coming out of several databases, transform it, join it together, and store it into a key-value store, search index, cache, or other data serving system. For a long time, ETL — Extract, Transform, and Load — for data integration was performed as periodic batch jobs. For example, dump the raw data in real time, and then transform it every few hours to enable efficient queries. For many use cases, this delay is unacceptable. KSQL, when used with Kafka connectors, enables a move from batch data integration to online data integration. You can enrich streams of data with metadata stored in tables using stream-table joins, or do simple filtering of PII (personally identifiable information) data before loading the stream into another system.

…

Imagine adding subject identity information to subjects in your incoming stream and merging those subjects in a stream.

Any view of your topic map will be as of #time-mark and not a static file.

Enjoy!

Comments Off

April 18, 2016

Data Poverty At Youtube

Filed under: Conferences,Data,Dublin Core,Metadata — Patrick Durusau @ 3:26 pm

I was writing Clojure/west 2016 – Videos! [+ Unix Sort Trick] when the itch to use Youtube APIs to facilitate extraction and re-use of conference videos struck yet again!

It lasted long enough this time for me to discover the data poverty at Youtube, even using their APIs.

Here’s what little relevant information Youtube captures for a video resource for my purposes:

{
  "kind": "youtube#video",
  "etag": etag,
  "id": string,
  "snippet": {
    "publishedAt": datetime,
    "channelId": string,
    "title": string,
    "description": string,
    "thumbnails": {
      (key): {
        "url": string,
        "width": unsigned integer,
        "height": unsigned integer
      }
    },
    "channelTitle": string,
    "tags": [
      string
    ],
    "categoryId": string,
    "liveBroadcastContent": string,
    "defaultLanguage": string,
    "localized": {
      "title": string,
    "description": string
    },
    "defaultAudioLanguage": string
  },
...

"topicDetails": {
    "topicIds": [
      string
    ],
    "relevantTopicIds": [
      string
    ]
  },
...

Hmmmm, do you see author, date, location, followed by any number of other bits of data that even minimal retrieval would warrant?

The response that all of those fall under “description,” is true, but leaves users with a prone to fail on search information resource.

The really sad part of this tale is that Youtube has built up such a large legacy of data impoverished video, that any curation will be automated and only spot-checked.

Rather than dig this dark-data hole any deeper, YouTube should add additional metadata by some fixed date.

Let’s not gin up new metadata categories/values but call upon librarians to suggest existing metadata standards, such as Dublin Core or others.

Librarians have labored at this task for centuries and Youtube is a good example as a result of their absence. Usable, but only just, and that only with the aid of powerful digital computers.

Let’s stop spreading data darkness in Youtube and make its data reusable.

Comments Off

October 16, 2015

Metadata that kills

Filed under: Free Speech,Metadata,Privacy — Patrick Durusau @ 2:56 pm

Metadata that kills by Peter Brantley.

From the post:

Recently I was advising a friend who works at a not-for-profit that distributes ebooks to underserved populations around the world. “Is there a way,” she asked, “to limit or select what portions of an ebook catalog are displayed to readers in one country versus another?” Was this an issue of rights, I wondered? “No,” she said. “In some countries, particularly for women, being caught reading the wrong book can have mortal consequences.”

It was a poignant point which caught me flat-footed. This is not a hard technical problem; depending on the distribution platform and reading system, title masking is not difficult. But it hadn’t occurred to me that metadata — or the lack of it — could quite literally translate into the death by stoning of a young girl brave or foolish enough to read a forbidden book. This is a world where there may need to be an ebook catalog for open societies, and another for countries in which the lives of women, for example, have no value.
…

A great read where the consequences of your choices may be paid by others. In full measure.

Puts a different spin on the knee-jerk insistence of no restrictions on free speech at all.

Certainly relevant to what content you would include in a topic maps based upon its likely readership.

I’m in the no restrictions on publications whatsoever camp, but, I’m not willing to have others to pay for the consequences for my choices.

You?

Comments Off

September 2, 2015

Metadata Munging with XQuery

Filed under: Library,MARC,Metadata,XQuery — Patrick Durusau @ 3:36 pm

Metadata Munging with XQuery (and freelib-marc4j-exist) by Kevin S. Clarke.

I searched and didn’t find a video to go along with these slides. If you know of one, please ping me.

Slides 5 through 20 are all XQueries that you can run with a XQuery 3.1 processor.

Slide 21 is a list of suggested resources on XQuery.

I would add Saxon to any list of XQuery tools. There is a commercial and home (free) version of Saxon.

Suffice it to say that Michael Kay, the primary author of Saxon, has played a major role in XSLT, XPath and XQuery for a number of years.

Nothing against other products but Saxon is the baseline against which XSLT, XPath and XQuery feature sets are measured.

Comments Off

June 29, 2015

Metadata is…

Filed under: Metadata,Topic Maps — Patrick Durusau @ 11:00 am

From a twitter post by Mike Olson that reads: “Swiped this photo from @nadaleen because, yeah.”

Useful graphic for a presentation on topic maps, metadata, documentation, etc.

Also consider: “Metadata is a love note to yourself” if you are addressing an audience of narcissists.

Comments Off

June 10, 2015

Metadictionary

Filed under: Dictionary,Metadata — Patrick Durusau @ 9:22 am

Metadictionary

From the homepage:

A crowdsourced metadata dictionary. Search for terms, upvote useful ones.

If you browse to association you will find:

A connection, asserted by an authority, between one thing and another.

If you follow the link to thing, you find:

Any entity, physical, digital, living, or abstract, or group thereof.

Whereas dictionary.reference.com gives us:

a material object without life or consciousness; an inanimate object

some entity, object, or creature that is not or cannot be specifically designated or precisely described

anything that is or may become an object of thought

Which definition have you used and how did you convey it to others? Or did you just “trust” they would understand the meaning you were using?

I first saw this in a tweet by Deborah A. Lapeyre.

Comments Off

March 13, 2015

For Context – Why Metadata Really Matters

Filed under: Context,Metadata — Patrick Durusau @ 10:23 am

For Context – Why Metadata Really Matters

March 26, 2015, 03:00PM ET

From the post:

For Context: Why Metadata Really Matters The creative geniuses behind Hitchhiker’s Guide to the Galaxy summed up the value of context with their brilliant machine-generated answer to the universe: 42! The dripping irony spoke of just how meaningless a number can be, absent its context. Today, with Big Data continuing to dominate the minds of enterprise architects and business analysts alike, the value of context is more important than ever. How can organizations keep their focus on the data that really matters? Metadata holds the key! Register for this episode of DM Radio to learn more! Host Eric Kavanagh will interview Roy Anjan of Deloitte, Dr. Robin Bloor of The Bloor Group, Dr. Geoffrey Malafsky of Phasic Systems Inc., and a special guest from Embarcadero.

May be something, may be nothing. No promises as I haven’t heard the podcast, yet.

It is interesting that context/metadata, etc. are making a comeback after being on the back burner for so many years.

My personal explanation is that data has gotten large enough that even the average IT person is encountering data they don’t understand more and more often. Which has lead many of them to conclude that as data gets bigger, so will their ignorance of strange data. No kidding.

The ignorance of others about your data is amusing, if not job protecting but your ignorance of strange data could be costly if not job threatening. As more data impinges on your borders the greater your need to understand it, even at the cost of losing some of the insular nature of your IT operations.

Company file clerks (Radar O’Reilly, who had a unique filing system for instance) or COBOL programmers with their spaghetti code, aren’t simply going to roll over and surrender secret knowledge it has taken them years to acquire. That’s what they call a “people problem.” (I have suggestions for incentives and disincentives for specific situations.)

Topic maps, being able to describe any subject, which includes subjects that are fields, terminology, processes, files, routines, anything you can imagine existing in IT, are a convenient way to capture “metadata” about IT and its processes. In part because they don’t have to burden your existing systems with changes or additions in order to make them more robust from a metadata perspective.

Think of it as “your present, metadata poor IT systems” versus “your present, metadata poor IT systems + topic maps.” That is topic maps don’t have to be a rip and replace technology (one way graph technology is promoted) but rather an addition to your present infrastructure that makes it more sustainable and robust. So all the political alliances and decisions that lead to your current IT structure can remain in place.

Something to think about as you wait for the podcast!

Enjoy!

Comments Off

November 23, 2014

Data Capture for the Real World

Filed under: Data Collection,Metadata,Science — Patrick Durusau @ 4:59 pm

Data Capture for the Real World by Cameron Neylon.

From the post:

Many efforts at building data infrastructures for the “average researcher” have been funded, designed and in some cases even built. Most of them have limited success. Part of the problem has always been building systems that solve problems that the “average researcher” doesn’t know that they have. Issues of curation and metadata are so far beyond the day to day issues that an experimental researcher is focussed on as to be incomprehensible. We clearly need better tools, but they need to be built to deal with the problems that researchers face. This post is my current thinking on a proposal to create a solution that directly faces the researcher, but offers the opportunity to address the broader needs of the community. What is more it is designed to allow that average researcher to gradually realise the potential of better practice and to create interfaces that will allow technical systems to build out better systems.

Solve the immediate problem – better backups

The average experimental lab consists of lab benches where “wet work” is done and instruments that are run off computers. Sometimes the instruments are in different rooms, sometimes they are shared. Sometimes they are connected to networks and backed up, often they are not. There is a general pattern of work – samples are created through some form of physical manipulation and then placed into instruments which generate digital data. That data is generally stored on a local hard disk. This is by no means comprehensive but it captures a large proportion of a lot of the work.

The problem a data manager or curator sees here is one of cataloguing the data created, creating a schema that represents where it came from and what it is. We build ontologies and data models and repositories to support them to solve the problem of how all these digital objects relate to each other.

The problem a researcher sees is that the data isn’t backed up. More than that, its hard to back up because institutional systems and charges make it hard to use the central provision (“it doesn’t fit our unique workflows/datatypes”) and block what appears to be the easiest solution (“why won’t central IT just let me buy a bunch of hard drives and keep them in my office?”). An additional problem is data transfer – the researcher wants the data in the right place, a problem generally solved with a USB drive. Networks are often flakey, or not under the control of the researcher so they use what is to hand to transfer data from instrument to their working computer.

The challenge therefore is to build systems under group/researcher control that the needs for backup and easy file transfer. At the same time they should at least start to solve the metadata capture problem and satisfy the requirements of institutional IT providers.
…

Cameron goes on to make a great plea for approaching data collection from labs staring with the most basic need: backups. Sure, data needs metadata, standard formats, etc. but those are secondary concerns (if that) to the researchers generating the data.

Only backup up data is likely to persist long enough for us to be concerned about metadata and standard formats. Even there Cameron argues that researchers need to see the pay-off from metadata before expecting them to enter it. Formats are more a matter of interchange of data and not a problem for local data.

Cameron’s payoff argument alludes to something that isn’t often discussed. From the perspective of a metadata person, metadata for data is extremely important, but they are not the person being asked to capture the metadata. From the perspective of a format person, an interchangeable format for data is extremely important, but they are not the person being asked to use the “correct” format.

The point is that we are all quite free with the time of others. That is we have all manner of suggestions that increases the work load of others and we not only expect them to use those suggestions but to be grateful we pointed the error of their ways out. That’s expecting a bit much.

As you know, metadata and formats are only two of many data issues that are very near and dear to me. But focusing on the failure of scientists to pay attention to such matters isn’t going to be as effective as creating tools that help scientists with their day to day work and return benefits to them. A much easier sell for issues that are of interest to others.

I first saw this in Nat Torkington’s Four short links: 19 November 2014.

Comments Off

October 22, 2014

HDFS Metadata Directories Explained

Filed under: HDFS,Metadata — Patrick Durusau @ 7:54 pm

HDFS Metadata Directories Explained by Chris Nauroth.

From the post:

HDFS metadata represents the structure of HDFS directories and files in a tree. It also includes the various attributes of directories and files, such as ownership, permissions, quotas, and replication factor. In this blog post, I’ll describe how HDFS persists its metadata in Hadoop 2 by exploring the underlying local storage directories and files. All examples shown are from testing a build of the soon-to-be-released Apache Hadoop 2.6.0.

WARNING: Do not attempt to modify metadata directories or files. Unexpected modifications can cause HDFS downtime, or even permanent data loss. This information is provided for educational purposes only.

Persistence of HDFS metadata broadly breaks down into 2 categories of files:

fsimage – An fsimage file contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonically increasing transaction ID. An fsimage file represents the file system state after all modifications up to a specific transaction ID.

edits – An edits file is a log that lists each file system change (file creation, deletion or modification) that was made after the most recent fsimage.

Checkpointing is the process of merging the content of the most recent fsimage with all edits applied after that fsimage is merged in order to create a new fsimage. Checkpointing is triggered automatically by configuration policies or manually by HDFS administration commands.

…

When someone says: Do not attempt to modify metadata directories or files., it is just like waving a red flag in front of a bull!

Translate that to mean that hackers will know how to modify metadata directories or files and the average HDFS developer won’t.

I’m not saying to modify HDFS metadata directories or files on a production system for practice!

Practice somewhere safe, like in a sandbox but do practice.

Anything that can cause HDFS downtime or permanent data loss should be a matter of interest.

Comments Off

October 4, 2014

Code as a research object: (new phase) standardizing software metadata

Filed under: Metadata,Programming — Patrick Durusau @ 3:35 pm

Code as a research object: (new phase) standardizing software metadata by Abigail Cabunoc.

From the post:

At the Science Lab, we want to help research thrive on the open web. Part of this is working with other community members to build technical prototypes that move science on the web forward. Earlier this year we saw several prototypes come out of the ‘Code as a Research Object’ collaboration. Since then, there’s been more conversation and effort in this space and we wanted to share our progress and invite the community to give input.

First, a quick look at ‘Code as a Research Object’

Late last year, “Code as a Research Object” was first announced as a new collaboration between the Science Lab, GitHub, figshare and Zenodo to help explore how to better integrate code and scientific software into the scholarly workflow. Since then, we’ve seen community members come together to build prototypes allowing users to easily get a DOI for their code, making it citable and easier to incorporate into the existing credit system.

Next Steps: Standardizing Metadata

Coming into the conversation, there’s still room for best practices for code reuse and citation. In particular, some form of standardized metadata would help other repositories understand how they can integrate with current systems.

When I was at NCEAS Open Science CodeFest (OSCodeFest) last month, I led a discussion around the work being done here. I was joined by Matt Jones, Carly Strasser and Corinna Gries, and we agreed that some standards need to be set to help more groups store software in a citable and interoperable manner.

Building on the existing discussions and proposals in the community, we compared the exiting schemas for code storage to help create a metadata standard that allows for discoverability, reuse and citation. You can see the notes from our discussion here.

This led to the creation of the codemeta GitHub repo to store a minimal metadata schema for science software in code in JSON-LD and XML. Since then, we’ve worked on refining the proposed metadata schema and creating mappings between some existing popular data stores. Coming soon: Matt Jones will be blogging on some of the more technical aspects of this project.

How to get involved

We’re looking for feedback on our current proposed metadata schema for code discovery, reuse and citation.

Codemeta GitHub repo and issue tracker: take a look and let us know what you think. Feel free to create issues and pull requests!

Join the discussion:

(JSON-LD) Metadata for software discovery

How should software be cited?

What information is needed to reuse code?

Here is your chance to contribute to a metadata standard for some sub-set of all software.

I say a sub-set because one of the certainties of standards is that if an area needs standardization there are going to be multiple, evolving standards for it.

That’s not a criticism of standards (I actively work on several) but statement about the reality of standards. They are useful even though very few ever become universal.

Comments Off

September 13, 2014

A schemaless computer database in 1965

Filed under: Computer Science,Metadata,NoSQL,Schema — Patrick Durusau @ 6:50 pm

A schemaless computer database in 1965 by Bob DuCharme.

From the post:

To enable flexible metadata aggregation, among other things.

I’ve been reading up on America’s post-war attempt to keep up the accelerated pace of R&D that began during World War II. This effort led to an infrastructure that made accomplishments such as the moon landing and the Internet possible; it also led to some very dry literature, and I’m mostly interested in what new metadata-related techniques were developed to track and share the products of the research as they led to development.
… (emphasis in original)

I won’t spoil the surprise. Go read Bob’s post to see the answer.

His post does prompt me to ask: What early computing “dry” literature have you read lately?

Comments Off

September 9, 2014

The Chemical Analysis Metadata Platform

Filed under: Cheminformatics,Chemistry,Metadata — Patrick Durusau @ 7:28 pm

The Chemical Analysis Metadata Platform

This project is focused on defining the important metadata (data about data) needed to describe a chemical analysis methodology. The idea is to evaluate the current and future needs for accurate representation of both classical (wet chemical) and instrumental analysis procedures and present a unified approach to metadata nomenclature, data types, data structures and semantic annotation.

So what does that really mean? Well, in the growing movement toward semantic annotation of science data there is a real need to provide descriptors (metadata) for all parts of science. With the exponential growth in raw data, having descriptors allows researchers a way to easily (we hope) provide context to the work they are doing. So, because the area of chemical analysis is so broad, and that it is likely that many groups will try and create there own standards for contextualizing the area, this project aims to provide an extensible platform that:

identifies key metadata for chemical analysis

outlines recommended practices for reporting the metadata

defines controlled vocabularies for important metadata (e.g. analysis technique, sample matrix)

defines an ontology for both metadata items and groups of metadata items

Note this project is about defining a platform. It is not, per se, about defining standards (i.e. defining what metadata must be used). However, standards are the application of the ChAMP platform in a particular area, and so we will also link to them once they are developed.

This project is very much a work in progress. It also needs to be defined and critically evaluated by the community and so we encourage you to be part of the process, via the discussion forums on this site, by participation in project conference calls (to be scheduled), through our Facebook page, or by email at the address on the right. And, please vote in our first poll!

For more information on the project look at the overview page linked top right. Stuart and Tony.

The important of metadata is that no string can stand on its own. Everyone “sees” an isolated string from their context, which may or may not match the context in which it was used by its author.

Hence, the opportunity for miss-understanding arises.

Not just in chemical analysis but in all other fields as well.

If you are involved or advising anyone involved in chemical analysis, pass this information along.

I first saw this in a tweet by Analyst.

Comments Off

July 27, 2014

Metadata: Organizing and Discovering Information

Filed under: Law,Metadata — Patrick Durusau @ 3:36 pm

Metadata: Organizing and Discovering Information by Jeffrey Pomerantz.

Coursera course described in part as follows:

If you use nearly any digital technology, you make use of metadata. Use an ATM today? You interacted with metadata about your account. Searched for songs in iTunes or Spotify? You used metadata about those songs. We use and even create metadata constantly, but we rarely realize it. Metadata — or data about data — describes real and digital objects, so that those objects may be organized now and found later.

Metadata is a tool that enables the information age functions performed by humans as well as those performed by computers. Metadata is important to many fields, particularly Computer Science; but this course is not purely a Computer Science course. This course approaches Metadata from the perspective of Information Science, which is a broad interdisciplinary field that studies how people create and manage information.

Course Syllabus

Unit 1: Organizing Information
Unit 2: Dublin Core
Unit 3: How to Build a Metadata Schema
Unit 4: Alphabet Soup: Metadata Schemas That You (Will) Know and Love
Unit 5: Metadata for the Web
Unit 6: Metadata for Networks
Unit 7: How to Create Metadata
Unit 8: How to Evaluate Metadata
…

An eight week course, July 14 – September 8, 2014, at 4 to 6 hours per week.

I first saw this in a tweet by Aaron Kirschenfeld that reads:

Every one of your legal hackers out there: where’s the metadata? Please learn from @jpom #metadatamooc on @coursera. My brain is crackling.

My follow-up question being: Where are the subject identifications to help map between heterogeneous metadata systems?

Comments Off

July 12, 2014

CSV on the Web: Metadata Vocabulary…

Filed under: CSV,Metadata,W3C — Patrick Durusau @ 6:11 pm

CSV on the Web: Metadata Vocabulary for Tabular Data and other updates by Dan Brickley.

From the post:

The CSV on the Web Working Group has published a First Public Working Draft of a Metadata Vocabulary for Tabular Data. This is accompanied by an update to the Model for Tabular Data and Metadata on the Web document, alongside the group’s recently updated Use Cases and Requirements document.

Validation, conversion, display and search of tabular data on the web requires additional metadata that describes how the data should be interpreted. The “Metadata vocabulary” document defines a vocabulary for metadata that annotates tabular data, at the cell, table or collection level, while the “Model” document describes a basic data model for such tabular data.

A large percentage of the data published on the Web is tabular data, commonly published as comma separated values (CSV) files. The Working Group welcomes comments on these documents and on their motivating use cases. The next phase of this work will involve exploring mappings from CSV into other popular representations. See the Working Group home page for more details or to get involved.

You need to find the time to read this draft.

There are 22 issues listed in the draft inviting your comments.

See in particular:

ISSUE 6

We invite comment on whether there are other standard metadata vocabularies that should be reused within this specification.

On a quick read, it appears that terms in “…standard metadata vocabularies…” are not subject to further annotation.

That is troubling because if I had a standard metadata vocabulary of war crimes terminology, my usage would show George W. Bush as both a war criminal and a terrorist.

Your usage of those terms may vary from mine. We should be able to discover and manage those differences.

Take the time to read the drafts and comment.

Comments Off

May 15, 2014

Metadata Can Be Deadly

Filed under: Metadata,NSA,Security — Patrick Durusau @ 12:39 pm

‘We Kill People Based on Metadata’ by David Cole.

From the post:

Supporters of the National Security Agency inevitably defend its sweeping collection of phone and Internet records on the ground that it is only collecting so-called “metadata”—who you call, when you call, how long you talk. Since this does not include the actual content of the communications, the threat to privacy is said to be negligible. That argument is profoundly misleading.

Of course knowing the content of a call can be crucial to establishing a particular threat. But metadata alone can provide an extremely detailed picture of a person’s most intimate associations and interests, and it’s actually much easier as a technological matter to search huge amounts of metadata than to listen to millions of phone calls. As NSA General Counsel Stewart Baker has said, “metadata absolutely tells you everything about somebody’s life. If you have enough metadata, you don’t really need content.” When I quoted Baker at a recent debate at Johns Hopkins University, my opponent, General Michael Hayden, former director of the NSA and the CIA, called Baker’s comment “absolutely correct,” and raised him one, asserting, “We kill people based on metadata.”
…

I am sympathetic to many of David’s points, such as the illegitimacy of distinguishing between citizens of the United States and “others,” his suspicions about the USA Freedom Act and its shortcomings.

However, given the readiness of the NSA to break existing laws and then to lie to Congress about having done so, gives me little faith in “reform” type legislation.

Moreover, the debate should not be on the proper limits of government surveillance of its citizens and the citizens of other countries to combat terrorism.

Once the debate is framed to presume some legitimate reason for surveillance, such as terrorism, those who oppose government surveillance have already lost the war. They may win some minor points to support future fund raising but government surveillance will controlled only by those using it. Not a good outcome in my view.

Before reaching the debate on government surveillance, our representatives should be challenging the talisman of terrorism which is offered to justify anti-terrorism measures. How many people died from terrorism last year in the United States? How many similar deaths will be prevented by program X? We may well decide that deaths at lower than the Atlanta metro area traffic accident rate aren’t worth government surveillance.

Sounds to me like we need more metadata about government surveillance programs and their details, not glowing summaries by those most interested in them continuing.

Comments Off

March 25, 2014

Create a search engine with schema.org types

Filed under: Metadata,Schema.org — Patrick Durusau @ 6:37 pm

Create a search engine with schema.org types

From the post:

We are happy to announce the integration of Google Custom Search with the schema.org standard. Schema.org is a structured data markup schema including a shared markup schema vocabulary that is supported by major search providers. This integration enables you to create powerful and expressive topical search engines by simply specifying schema.org types in your Google Custom Search Engine definition.

How would you go about using this new feature? Say you are the webmaster of a site about movies. You might want to create a movie search engine that can search for pages about movies either from your website, your affiliated websites or from the millions of sites that use schema.org. Achieving this functionality is now only a click away thanks to the integration of Google Custom Search with schema.org. All you have to do is add the schema type “Movie” to your Custom Search Engine definition, as shown below, and you’re done! Users of your movie search engine will then only see result pages that have the “Movie” schema annotation.

Curious, what do you think it would take to support the use of schema.org or extensions to schema.org at lower than a document level?

Finding documents is ok, if that’s the best you can do. But I would rather find specific portions of documents with relevant material.

More than two questions but to start with:

What would be required of a document syntax?
What would be required of indexing software to capture the schema.org data along with its “tagged” data?

Comments Off

March 5, 2014

MISC

Filed under: Homoiconic,Language,Lisp,Metadata — Patrick Durusau @ 8:40 pm

MISC

From the webpage:

MISC is a homoiconic, non-strict, metadata rich, language that uses maps as its base data structure.

MISC attempts to combine the simplicity of the syntax and design of LISP, the lazy semantics of a language like Haskell, the fundamental data type of Lua and Javascript and a syntax similar to that of Smalltalk. MISC could be thought of as a lazily evaluated LISP where the fundamental data type of the language is a mapping.

The design and concept of MISC originated from three distinct ideas. Firstly to create an alternative data language to XML that provided a similar but consistent structure. Secondly to create a LISP like language using associative arrays or maps instead of lists. And thirdly to support rich metadata within the programming language, and to provide consistent access to it through reflection.

The main interesting aspects of the design are:

Lazy homoiconicity

Deep integration of metadata

Focusing on maps rather than lists

You can get the source code here: MISC source.

This looks very interesting!

Do be aware that the page is targeted at IE on Windows. Why I cannot say.

I first saw this in a tweet by Daniel Higginbotham.

Comments Off

February 9, 2014

Querying my own MP3…

Filed under: Indexing,Metadata,SPARQL — Patrick Durusau @ 7:22 pm

Querying my own MP3, image, and other file metadata with SPARQL by Bob DuCharme.

From the post:

Ubuntu has a utility called Tracker that makes it easy to search your hard disk, a bit like the old Google Desktop with a few extra features. One extra feature ranks among the coolest SPARQL applications I’ve ever seen: the ability to execute SPARQL queries against data extracted from files on your hard disk.

To install it, I did a sudo apt-get install of tracker-gui to get the base parts of tracker and then did a similar installation of tracker-utils to get the SPARQL query utility. Next, I added the Ubuntu applications “Desktop search” and “search and indexing” as applications and used the latter to search and index 94 GB of MP3s and some image files. The indexing took a few hours. (tracker-control -S was a handy command for checking on the indexing progress.) The worldofgnome.org page Indexing preferences in GNOME 3.8 was helpful for understanding the indexing options.

Once the file metadata is indexed, the tracker-sparql command-line utility lets you query it. For example, the following runs the query stored in bea.spq against the metadata:
…

Interesting.

If you search for tracker with Aptitude, you may want to try: tracker-explorer, or at least that is how it was listed for Ubuntu 12.04.

Now you have an incentive to attach metadata to those image files on your hard drive.

Comments Off

February 8, 2014

Threaded Publications: one step closer [Disorderly People]

Filed under: Bioinformatics,Biomedical,Metadata — Patrick Durusau @ 2:37 pm

Threaded Publications: one step closer by Daniel Shanahan.

From the post:

“It is difficult to make informed decisions if publication bias and selective reporting are present” World Health Organization

For years, researchers have drawn attention to this, highlighting discrepancies between protocols submitted to research ethics committees and those reported in the results papers, issues concerning statistical power, and the difficulty in identifying unpublished studies. Indeed, it was concerns like these that lead to most major medical journals making registration of clinical trials a prerequisite for publication.

However, even for those clinical trials that have been registered, it can be difficult to track down related content. Not all journals publish the trial ID in the body of the article; therefore, although a results article may cite a published protocol, there is nothing to connect that article to subsequent publications. And nothing to link from the protocol to the results article.

In 1999, Altman and Chalmers envisioned a solution to this. In their article in The Lancet they wrote: “Electronic publication of a protocol could be simply the first element in a sequence of ‘threaded’ electronic publications, which continues with reports of the resulting research (published in sufficient detail to meet some of the criticisms of less detailed reports published in print journals), followed by deposition of the complete data set.” This was the first description of ‘Threaded Publications’.

….

The usual consequences of expecting disorderly people to act in an orderly manner.

Seriously, I don’t doubt for a moment that every author, every journal and every reader, would support consistent citation practices across medical literature.

Unfortunately, being “disorderly” as I said, people need information solutions that can tolerate our disorder.

Since topic maps don’t require (but can be improved by) general agreement and use of single identifiers, categories, or types, everyone can “tag” their articles and content without waiting for mass agreement.

If and when we all agree on terms, we need not replace the old terms but simply add the new terms. Which insures anyone who was accustomed to the “old style” will find the new information, even when using dated references.

Comments Off

February 1, 2014

Baloo [KDE drops RDF]

Filed under: Metadata,Microformats,RDF — Patrick Durusau @ 4:28 pm

Baloo

From the post:

Baloo is the next generation of the Nepomuk project. It’s responsible for handling user metadata such as tags, rating and comments. It also handles indexing and searching for files, emails, contacts, and so on. Baloo aims to be lighter on resources and more reliable than its parent project.

…

The Nepomuk project started as a research project in the European Union. The goal was to explore the use of relations between data for finding what you are looking for. It was build completely on top of RDF. While RDF is a great from a theoretical point of view, it is not the simplest tool to understand or optimize. The databases which currently exist for RDF are not suited for desktop use.

The Nepomuk developers have tried very hard over the last years to optimize the indexing and searching infrastructure, and they have now come to the conclusion that Nepomuk cannot be further optimized without migrating away from RDF.

RDF also heavily relied on ontologies. These ontologies are a way to describe how the data should be stored and represented. They used the ontologies from the original EU research project – Shared Desktop Ontologies. These ontologies were not designed in a time when it was not very clear how they would work and have sub-optimal performance and ease of use. They are quite vague in certain areas and often duplicate information. This leads to scenarios where it takes forever to figure out how the data should be stored. Additionally, since all the data needs to be stored in RDF, one cannot optimize for one specific data type.

Given these shortcomings and the many lessons learned over the last years the Nepomuk developers decided to drop RDF and rechristen the project under the name of Baloo. You can find more technical background and info on its architecture here.

I suggested to someone in synchronous time that authoring support for schema.org based metadata could be a win-win for users and document processing software.

For users, search appliances, local or even Google, can ingest “lite” schema definitions that provide immediate ROI on adding semantics to your documents. Well, I say immediate, as soon as they are indexed.

That should require no more skill than being able to type, assuming your document software can recognize the terms you use and annotate them properly.

Think of the different between the number of people who can author XML using MS Office or Apache OpenOffice, etc. Now compare that to people who natively author DocBook documents.

If you want a successful strategy, do you follow the one that has resulted in a user base measured in increments of hundred’s of millions or do you prefer the righteous remnant approach with say less than 50,000?

I’m no marketing person but even I know the answer to that one.

PS: There are some ankle biters who complain about the MS Office user numbers. Let’s just say between MS Office and Apache OpenOffice and the other ODF based word processors, that DocBook users are out-numbered by at least 20,000 to 1. Who needs more accurate numbers than that?

PPS: Microformats don’t have the precision that RDF and/or Topic Maps have to offer. But precision without adoption can’t be very precise. With adoption of microformats, more precision can be added as required by particular use cases.

I first saw this in a tweet by Jan Schnasse.

Comments Off

January 13, 2014

The Information Master…

Filed under: Data,Indexing,Information Retrieval,Library,Metadata — Patrick Durusau @ 8:28 pm

The Information Master – Louis XIV’S Knowledge Manager

From the post:

I recently read The Information Master: Jean-Baptiste Colbert‘s Secret State Intelligence System by Jacob Soll. It is a very readable but scholarly book that tells the story of how Colbert used the accumulation of knowledge to build a highly efficient administrative system and to promote his own political career. He seems to have been the first person to seize upon the notion of “evidence-based” politics and that knowledge, information and data collection, and scholarship could be used to serve the interests of statecraft. In this way he is an ancestor of much of the thinking that is commonplace not only in today’s political administrations but also in all organizations that value the collection and management of information. The principle sits at the heart of what we mean by the “knowledge economy”.

If you aren’t curious about this book by the time you finish the review, you must have landed on this post by mistake.

Seriously, the information issues that bedevil us now are not new.

It is only our collective ignorance of the past that makes them seem new.

Very high on my personal reading list.

Comments Off

January 10, 2014

…Care and Feeding of Scientific Data

Filed under: Data,Documentation,Metadata,Science — Patrick Durusau @ 2:45 pm

10 Simple Rules for the Care and Feeding of Scientific Data by Alyssa Goodman, et. al.

From the introduction:

In the early 1600s, Galileo Galilei turned a telescope toward Jupiter. In his log book each night, he drew to-scale schematic diagrams of Jupiter and some oddly-moving points of light near it. Galileo labeled each drawing with the date. Eventually he used his observations to conclude that the Earth orbits the Sun, just as the four Galilean moons orbit Jupiter. History shows Galileo to be much more than an astronomical hero, though. His clear and careful record keeping and publication style not only let Galileo understand the Solar System, it continues to let anyone understand how Galileo did it. Galileo’s notes directly integrated his data (drawings of Jupiter and its moons), key metadata (timing of each observation, weather, telescope properties), and text (descriptions of methods, analysis, and conclusions). Critically, when Galileo included the information from those notes in Siderius Nuncius [1], this integration of text, data and metadata was preserved, as shown in Figure 1. Galileo’s work advanced the “Scientific Revolution,” and his approach to observation and analysis contributed significantly to the shaping of today’s modern “Scientific Method” [2,3].

Goodman and co-authors from major research and educational institutions set forth ten (10) rules for the “care and feeding” of data:

Love your data, and help others love it too.

Share your data online, with a permanent identier.

Conduct science with a particular level of reuse in mind.

Publish work ow as context.

Link your data to your publications as often as possible.

Publish your code (even the small bits).

Say how you want to get credit.

Foster and use data repositories.

Reward colleagues who share their data properly.

Be a booster for data science.

See the paper for the details but be aware the first seven (7) rules all mention documentation.

Comments Off

January 8, 2014

Our Peeping Tom Government

Filed under: Cybersecurity,Metadata,NSA,Security — Patrick Durusau @ 5:58 pm

immersion screencap

Graphs by MIT Students Show the Enormously Intrusive Nature of Metadata by Kade Crockford.

From the post (as is the image):

You’ve probably heard politicians or pundits say that “metadata doesn’t matter.” They argue that police and intelligence agencies shouldn’t need probable cause warrants to collect information about our communications. Metadata isn’t all that revealing, they say, it’s just numbers.

But the digital metadata trails you leave behind every day say more about you than you can imagine. Now, thanks to two MIT students, you don’t have to imagine—at least with respect to your email.

Deepak Jagdish and Daniel Smilkov’s Immersion program maps your life, using your email account. After you give the researchers access to your email metadata—not the content, just the time and date stamps, and “To” and “Cc” fields—they’ll return to you a series of maps and graphs that will blow your mind. The program will remind you of former loves, illustrate the changing dynamics of your professional and personal networks over time, mark deaths and transitions in your life, and more. You’ll probably learn something new about yourself, if you study it closely enough. (The students say they delete your data on your command.)
…

If you have any interest in privacy at all, you need to read Kade’s post in full and watch video.

Personally I don’t think we can recover our privacy from the government. After all, it is already illegal for the NSA to be spying on U.S. citizens. Their excuse (now and in the future), “it was necessary.”

On the upside, we can and should deprive government toadies and others of privacy in the performance of official government functions. There is no right to privacy in order to loot the U.S. treasury or to use elected/appointed office for political or personal gain.

Keep the locations of nuclear weapons and codes secret. Throw the rest of it wide open.

As the Supreme Court stated in United States v. Nixon

The expectation of a President to the confidentiality of his conversations and correspondence, like the claim of confidentiality of judicial deliberations, for example, has all the values to which we accord deference for the privacy of all citizens and, added to those values, is the necessity for protection of the public interest in candid, objective, and even blunt or harsh opinions in Presidential decision-making. A President and those who assist him must be free to explore alternatives in the process of shaping policies and making decisions and to do so in a way many would be unwilling to express except privately. These are the considerations justifying a presumptive privilege for Presidential communications. (emphasis added)

Since we as citizens have no privacy from the government, it only stands to reason that the government, including the President, has no privacy from citizens.

BTW, as a historical note, there has been no shortage of self-serving advisers for presidents following Nixon and the disclosure of the Watergate tapes.

Comments Off

November 15, 2013

High Accuracy Metadata and Machine Learning: A librarian’s success

Filed under: Artificial Intelligence,Clustering,Indexing,Machine Learning,Metadata — Patrick Durusau @ 7:36 pm

High Accuracy Metadata and Machine Learning: A librarian’s success by Ashleigh Faith.

Description:

Taxonomy is a field that spans both LIS and SIS studies. Ashleigh Faith will be presenting how a librarian can use information science to set up a taxonomy and machine learning process for complex content. Faith created a taxonomy, based in engineering mobility and science terminology, from scratch. Developing a cohesive taxonomy that would also facilitate automatic indexing on content for an engineering database that reaches more than 218,000 documents (and growing) across eight different content types was a challenge. The nature of scientific content makes automatic indexing difficult because it is considered complex –or outside the established standards of taxonomy. Faith discusses the process used to establish a taxonomy to capture content and create the bedrock in which the indexing software could be trained. Using traditionally linguistic techniques, Faith improved the database taxonomy metadata assignment accuracy to 89% accuracy, well above the typically accepted 75% accuracy rate of automatic indexing, and established a repeatable process that was also implemented successfully on NASA Tech Brief content, NATO Terminology Directives, and DOD content. Learn from concrete examples, lessons learned from a librarians perspective, and how to duplicate the process.

Slides: http://prezi.com/jr61rhjoqotb/?utm_campaign=share&utm_medium=copy

Just working from the slides, this was a presentation to see!

In a nutshell: Ahsleigh’s approach netted 89% accuracy, as compared to human indexer accuracy of 91% and typical automated indexing at 75%.

Good illustration of the content finding rules:

Rule 1: If you don’t want to find content, don’t hire a librarian.

Rule 2: If you do want to find content, hire a librarian.

Clear enough?

Comments Off

October 11, 2013

…Share your metadata standards!

Filed under: Metadata,Standards,Topic Maps — Patrick Durusau @ 12:13 pm

RDA – Metadata Standards Directory WG: Share your metadata standards!

From the post:

The Metadata Standards Directory (MASDIR) Working Group of the Research Data Alliance (RDA) is working on developing a collaborative and open directory of metadata standards that are used in scientific data contexts. The goal is to contribute to addressing infrastructure challenges.

MASDIR has now started its work by asking the community for contributions. They are looking for information about metadata standards; the tools and use cases associated with them, and additional information that shows where and how scientists use them worldwide.

Contributions are submitted through a web form, available at Metadata Directory Information Collection.

Just collecting all the metadata standards and providing access would be a major step forward.

An moving target to be sure but that’s why it will require a moving solution, topic maps!

Comments Off

September 10, 2013

What an Old Dictionary teaches us about Metadata

Filed under: Dictionary,Metadata,Reference — Patrick Durusau @ 3:25 am

What an Old Dictionary teaches us about Metadata by Jim Harris.

From the post:

Spelling, pronunciation, and examples of usage are included in the dictionary definition of a word, which is a good example of one of the many uses of metadata, namely to provide a definition, description, and context for data.

Pictured to the left is the dictionary that has been on my desk for over 15 years, which is a good metaphor for the challenges of metadata management.

When I first bought the dictionary, it was, as its front cover attested, “The Newest. The Best. A Trusted Authority. A brand-new dictionary of the 1990s, for the 1990s. Comprehensive coverage of current words and terms, with clear, understandable definitions and up-to-the-minute usage guidance.”

And its back cover boasted of “60,000 entries assembled by a state-of-the-art authority using the most modern sources of information, and prepared by lexicographic experts to provide the one-stop reference book to turn to for all of your word questions.” (However, if one of your word questions was about metadata you were out of luck because it didn’t have an entry for it.)

The multidimensionality of metadata is exemplified by how a dictionary rarely contains a single definition for a word, and an old dictionary exemplifies how constantly changing semantics further complicate metadata management.

Jim laments that he has never seen a “metadata dictionary” that provides a “…one-stop reference to turn to for all your data questions.”

Topic maps have the capability to be a “one-top reference” but Jim’s observation is a fair one: Have you ever seen one?

Technical capability is a requirement but execution based on technical capability is required to make such a “one-stop reference” a reality.

Or as Jim concludes:

An old dictionary reminds us that language — and especially its everyday usage — evolves. An old dictionary also teaches us that metadata — and especially the data it defines, describes, and provides a context for — evolves as well. Which is probably why doing metadata management well is not, well, something that just automagically happens.

Is the lack of “automagic” solutions holding back your topic map project?

Comments Off

June 15, 2013

Tagging, Peer Review & Journals
Tags + Author = Semantic Bullet?

Filed under: Metadata,Publishing,Tagging — Patrick Durusau @ 1:42 pm

John Baez writes about a new software package to address two problems of concern to academics: 1) expensive journals and 2) ineffective peer review. (I can write volumes about ineffective peer review in SDOs but will save that for another day.)

The Selected Papers Network (Part 1)
The Selected Papers Network (Part 2)

See John’s post both for details on the problems and the solution developed by Christopher Lee.

In part the solution relies upon customary hash tags but there is a step in a new direction.

From the second post:

These tags are public; that is, everyone can see what topics the paper has been tagged with, and who tagged them.

Experience has shown that hash tags are no more or less ambiguous than our use of natural language. But with the Selected Papers Network we have the hashtag and its author.

Authors themselves are inconsistent but most authors are trying to communicate and that requires the consistency expected by a particular audience.

Having an author also eases the task of assigning a particular tag, by a defined group of users, to a particular domain. And for that domain, creating an explicit semantic for the tag.

An explicit semantic that could be displayed for users of a tag in a domain, creating a feedback loop on the semantics of the tag.

Any number of syntax proposals have been made in efforts to induce users to author machine readable semantic annotations. All have universally failed.

Is tag + author enough to distinguish the semantics of tags?

Do we need require any more of authors to indicate their semantics?

Comments Off

June 12, 2013

Using Metadata to Find Paul Revere [In a Perfect World]

Filed under: Metadata,NSA,Security — Patrick Durusau @ 4:20 pm

Using Metadata to Find Paul Revere by Kieran Healy.

From the post:

London, 1772.

I have been asked by my superiors to give a brief demonstration of the surprising effectiveness of even the simplest techniques of the new-fangled Social Networke Analysis in the pursuit of those who would seek to undermine the liberty enjoyed by His Majesty’s subjects. This is in connection with the discussion of the role of “metadata” in certain recent events and the assurances of various respectable parties that the government was merely “sifting through this so-called metadata” and that the “information acquired does not include the content of any communications”. I will show how we can use this “metadata” to find key persons involved in terrorist groups operating within the Colonies at the present time. I shall also endeavour to show how these methods work in what might be called a relational manner.
(…)

An extremely well-written and highly imaginative example of social network analysis.

With one flaw, a fatal one I’m afraid.

What is the first thing you notice about the data? The very first thing?

It’s clean!

Clean data is almost unknown in the real world.

Think about the last time you got into an argument with your credit card company. Or with the credit report bureau. Or with anyone else who collects data.

Dirty data is just a fact of life.

In a perfect world, the one software vendors/contractors imagine, yes, perfect matches come up every time. Particularly when mapping across data sets.

Because in their perfect world, Paul Revere is never P. Revere (or Revoire), or with varying birth dates December 21, 1734 (Old Style) or January 1, 1735 (modern calendar), who was a silversmith and/or a dentist. (Paul Revere)

To list just a few of the possible confusions.

Unfortunately, our leadership accepts uncritically claims based on data cleanliness that no practicing DBA has ever seen in raw data.

Considering the $billions at stake for agencies and contractors, their motives are clear.

What is puzzling is why our leadership doesn’t make that connection?

Comments (1)

May 16, 2013

Metadata Collection Strategies

Filed under: Data Collection,Metadata — Patrick Durusau @ 12:49 pm

Metadata Collection Strategies by Maish Nichani and Patrick Lambe.

From the post:

Metadata can be collected in many ways—from the information environment, work activities and from people. The problem arises when metadata that could be effectively collected from the environment is delegated to be collected from people. People who are in the middle of work tasks do not see direct benefits from completing numerous metadata fields. When coerced into doing unnatural things, they usually revolt or find workarounds thereby undermining the entire initiative.

In this article we share strategies to collect metadata that lower the reliance on people in supplying metadata. We cannot completely remove people from the equation but we can prevent them from doing additional work, and focus the role of people on the value added metadata that machines and environment cannot automatically supply.

Maish and Patrick suggest several places where metadata can be collected without asking users.

I would go a step further and create a topic template for collecting metadata.

For a blog, having collected the author and other information once, there really isn’t a reason to collect it for every post that appears.

The same would be true for journals, where a topic template could assist with creating domains for vocabulary usage.

For example, when searching for a genome, limiting a search to genomic research archives, avoids part numbers and other overloading of a genome identifier.

Our machines don’t have to solve searching problems without human assistance. Particularly when a small assist can pay such high dividends in search results.

Comments Off

April 11, 2013

LODE-BD Recommendations 2.0

Filed under: Crosswalk,Linked Data,LOD,Metadata — Patrick Durusau @ 9:08 am

LODE-BD Recommendations 2.0

From the post:

LODE-BD aims to support the selection of appropriate encoding strategies for producing meaningful Linked Open Data (LOD)-enabled bibliographical data (directly or indirectly). The LODE-BD recommendations are applicable for structured data describing bibliographic resources such as articles, monographs, theses, conference papers, presentation materials, research reports, learning objects, etc. – in print or electronic format.

The core component of LODE-BD contains a set of recommended decision trees for common properties used in describing a bibliographic resource instance. Each decision tree is delivered with various acting points and the matching encoding suggestions. The full range of options presented by LODE-BD will enable data providers to make their choices according to their development stages, internal data structures, and the reality of their practices.

What's new in LODE-BD 2.0

Background information and references are moved into appendixes.

Metadata terms recommended by LODE-BD 2.0 are not limited to subject-specific domains. Agricultural-related namespaces and vocabularies are removed from the 2.0 version. LODE-BD now are appropriate for use by any data providers and repositories.

A road-map is added to guide the navigation of LODE-BD sections.

A crosswalk is added which maps the metadata terms used in the LODE-BD 2.0 with schema.org properties. It is attached as Appendix 4.

The post also breaks down the report into individual sections.

Of particular interest will be:

Appendix 4. Crosswalk of Metadata Term used in LODE-BD and schema.org terms

As with most crosswalks, the mapping does not enumerate the properties that compelled the crosswalk author to make the connections they did.

It it had, it would be easier to maintain and to merge with other crosswalks.

Comments Off

Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 31, 2017

April 18, 2016

October 16, 2015

September 2, 2015

June 29, 2015

June 10, 2015

March 13, 2015

November 23, 2014

October 22, 2014

October 4, 2014

How to get involved

September 13, 2014

September 9, 2014

July 27, 2014

July 12, 2014

May 15, 2014

March 25, 2014

March 5, 2014

February 9, 2014

February 8, 2014

February 1, 2014

January 13, 2014

January 10, 2014

January 8, 2014

November 15, 2013

October 11, 2013

September 10, 2013

June 15, 2013

June 12, 2013

May 16, 2013

April 11, 2013