Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 26, 2017

Alert! IAB workshop on Explicit Internet Naming Systems

Filed under: Names,WWW — Patrick Durusau @ 8:01 am

IAB workshop on Explicit Internet Naming Systems by Cindy Morgan.

From the post:

Internet namespaces rely on Internet connected systems sharing a common set of assumptions on the scope, method of resolution, and uniqueness of the names. That set of assumption allowed the creation of URIs and other systems which presumed that you could authoritatively identify a service using an Internet name, a service port, and a set of locally-significant path elements.

There are now multiple challenges to maintaining that commonality of understanding.

  • Some naming systems wish to use URIs to identify both a service and the method of resolution used to map the name to a serving node. Because there is no common facility for varying the resolution method in the URI structure, those naming systems must either mint new URI schemes for each resolution service or infer the resolution method from a reserved name or pattern. Both methods are currently difficult and costly, and the effort thus scales poorly.
  • Users’ intentions to refer to specific names are now often expressed in voice input, gestures, and other methods which must be interpreted before being put into practice. The systems which carry on that interpretation often infer which intent a user is expressing, and thus what name is meant, by contextual elements. Those systems are linked to existing systems who have no access to that context and which may thus return results or create security expectations for an unintended name.
  • Unicode allows for both combining characters and composed characters when local language communities have different practices. When these do not have a single normalization, context is required to determine which to produce or assume in resolution. How can this context be maintained in Internet systems?

While any of these challenges could easily be the topic of a stand-alone effort, this workshop seeks to explore whether there is a common set of root problems in the explicitness of the resolution context, heuristic derivation of intent, or language matching. If so, it seeks to identify promising areas for the development of new, more explicit naming systems for the Internet.

We invite position papers on this topic to be submitted by July 28, 2017 to ename@iab.org. Decisions on accepted submissions will be made by August 11, 2017.

Proposed dates for the workshop are September 28th and 29th, 2017 and the proposed location is in the Pacific North West of North America. Finalized logistics will be announced prior to the deadline for submissions.

When I hear “naming” and “Internet” in the same sentence, the line, “Oh no, no, please God help me!,” from Black Sabbath‘s Black Sabbath:

https://youtu.be/qrVKmTPFYZ8?t=238

Well, except that the line needs to read:

Oh no, no, please God help us!

since any proposal is likely to impact users across the Internet.

The most frightening part of the call for proposals reads:

While any of these challenges could easily be the topic of a stand-alone effort, this workshop seeks to explore whether there is a common set of root problems in the explicitness of the resolution context, heuristic derivation of intent, or language matching. If so, it seeks to identify promising areas for the development of new, more explicit naming systems for the Internet.

Are we doing a clean reboot on the problem of naming? “…[A] common set of root problems….[?]”

Research on and design of “more” explicit naming systems for the Internet could result in proposals subject to metric evaluations. Looking for common “root problems” in naming systems, is a recipe for navel gazing.

May 4, 2016

No Label (read “name”) for Medical Error – Fear of Terror

Filed under: Names,Subject Identity,Topic Maps — Patrick Durusau @ 2:06 pm

Medical error is third biggest cause of death in the US, experts say by Amanda Holpuch.

From the post:

Medical error is the third leading cause of death in the US, accounting for 250,000 deaths every year, according to an analysis released on Tuesday.

There is no US system for coding these deaths, but Martin Makary and Michael Daniel, researchers at Johns Hopkins University’s school of medicine, used studies from 1999 onward to find that medical errors account for more than 9.5% of all fatalities in the US.

Only heart disease and cancer are more deadly, according to the Centers for Disease Control and Prevention (CDC).

The analysis, which was published in the British Medical Journal, said that the science behind medical errors would improve if data was shared internationally and nationally “in the same way as clinicians share research and innovation about coronary artery disease, melanoma, and influenza”.

But death by medical error is not captured by government reports because the US system for assigning a code to cause of death, the international classification of disease (ICD), does not have a label for medical error.

In contrast to topic maps, where you can talk about any subject you want, the international classification of disease (ICD), does not have a label for medical error.

Impact? Not having a label conceals approximately 250,000 deaths per year in the United States.

What if Fear of Terror press releases were broadcast but along with “deaths due to medical error to date this year” as contextual information?

Medical errors result in approximately 685 deaths per day.

If you heard the report of the shootings in San Bernardino, December 2, 2015 and that 14 people were killed and the report pointed out that to date, approximately 230,160 had died due to medical errors, which one would you judge to be the more serious problem?

Lacking a label for medical error as cause of death, prevents public discussion of the third leading cause of death in the United States.

Contrast that with the public discussion over the largely non-existent problem of terrorism in the United States.

February 26, 2016

Sticks and Stones: How Names Work & Why They Hurt

Filed under: Language,Names — Patrick Durusau @ 1:19 pm

Sticks and Stones (1): How Names Work & Why They Hurt by Michael Ramscar.

Sticks and Stones (2): How Names Work & Why They Hurt

Sticks and Stones (3): How Names Work & Why They Hurt

From part 1:

In 1781, Christian Wilhelm von Dohm, a civil servant, political writer and historian in what was then Prussia published a two volume work entitled Über die Bürgerliche Verbesserung der Juden (“On the Civic Improvement of Jews”). In it, von Dohm laid out the case for emancipation for a people systematically denied the rights granted to most other European citizens. At the heart of his treatise lay a simple observation: The universal principles of humanity and justice that framed the constitutions of the nation-states then establishing themselves across the continent could hardly be taken seriously until those principles were, in fact, applied universally. To all.

Von Dohm was inspired to write his treatise by his friend, the Jewish philosopher Moses Mendelssohn, who wisely supposed that even though basic and universal principles were involved, there were advantages to be gained in this context by having their implications articulated by a Christian. Mendelssohn’s wisdom is reflected in history: von Dohm’s treatise was widely circulated and praised, and is thought to have influenced the French National Assembly’s decision to emancipate Jews in France in 1791 (Mendelssohn was particularly concerned at the poor treatment of Jews in Alsace), as well as laying the groundwork for the an edict that was issued on behalf of the Prussian Government on the 11th of March 1812:

“We, Frederick William, King of Prussia by the Grace of God, etc. etc., having decided to establish a new constitution conforming to the public good of Jewish believers living in our kingdom, proclaim all the former laws and prescriptions not confirmed in this present edict to be abrogated.”

To gain the full rights due to a Prussian citizen, Jews were required to declare themselves to the police within six months of the promulgation of the edict. And following a proposal put forward in von Dohm’s treatise (and later approved by David Friedländer, another member of Mendelssohn’s circle who acted as a consultant in the drawing up of the edict), any Jews who wanted to take up full Prussian citizenship were further required to adopt a Prussian Nachname.

What we call in English, a ‘surname.’

From the vantage afforded by the present day, it is easy to assume that names as we now know them are an immutable part of human history. Since one’s name is ever-present in one’s own life, it might seem that fixed names are ever-present and universal, like mountains, or the sunrise. Yet in the Western world, the idea that everyone should have an official, hereditary identifier is a very recent one, and on examination, it turns out that the naming practices we take for granted in modern Western states are far from ancient.

A very deep dives on person names across the centuries and the significance attached to them.

Not an easy read but definitely worth the time!

It may help you to understand why U.S.-centric name forms are so annoying to others.

November 18, 2015

Knowing the Name of Something vs. Knowing How To Identify Something

Filed under: Identification,Names,Subject Identity — Patrick Durusau @ 10:08 pm

Richard Feynman: The Difference Between Knowing the Name of Something and Knowing Something

From the post:


In this short clip (below), Feynman articulates the difference between knowing the name of something and understanding it.

See that bird? It’s a brown-throated thrush, but in Germany it’s called a halzenfugel, and in Chinese they call it a chung ling and even if you know all those names for it, you still know nothing about the bird. You only know something about people; what they call the bird. Now that thrush sings, and teaches its young to fly, and flies so many miles away during the summer across the country, and nobody knows how it finds its way.

Knowing the name of something doesn’t mean you understand it. We talk in fact-deficient, obfuscating generalities to cover up our lack of understanding.

You won’t get to see the Feynman quote live because it has been blocked by BBC Worldwide on copyright grounds. No doubt they make a bag full of money every week off that 179 second clip of Feynman.

The stronger point for Feynman would be to point out that you can’t recognize anything on the basis of knowing a name.

I may be sitting next to Cindy Lou Who on the bus but knowing her name isn’t going to help me to recognize her.

Knowing the name of someone or something isn’t useful unless you know something about the person or thing you associate with a name.

That is you know when it is appropriate to use the name you have learned and when to say: “Sorry, I don’t know your name or the name of (indicating in some manner).” At which point you will learn a new name and store a new set of properties to know when to use that name, instead of any other name you know.

Everyone does that exercise, learning new names and the properties that establish when it is appropriate to use a particular name. And we do so seamlessly.

So seamlessly that when called upon to make explicit “how” we know which name to use, subject identification in other words, it takes a lot of effort.

It’s enough effort that it should be done only when necessary and when we can show the user an immediate semantic ROI for their effort.

More on this to follow.

November 16, 2015

Unpronounceable — why can’t people give bioinformatics tools sensible names?

Filed under: Bioinformatics,Humor,Names — Patrick Durusau @ 11:46 am

Unpronounceable — why can’t people give bioinformatics tools sensible names? by Keith Bardnam.

From the post:

Okay, so many of you know that I have a bit of an issue with bioinformatics tools with names that are formed from very tenuous acronyms or initialisms. I’ve handed out many JABBA awards for cases of ‘Just Another Bogus Bioinformatics Acronym’. But now there is another blight on the landscape of bioinformatics nomenclature…that of unpronounceable names.

If you develop bioinformatics tools, you would hopefully want to promote those tools to others. This could be in a formal publication, or at a conference presentation, or even over a cup of coffee with a colleague. In all of these situations, you would hope that the name of your bioinformatics tool should be memorable. One way of making it memorable is to make it pronounceable. Surely, that’s not asking that much? And yet…

The examples Keith recites are quite amusing and you can find more at the JABBA awards.

He also includes some helpful advice on naming:

There is a lot of bioinformatics software in this world. If you choose to add to this ever growing software catalog, then it will be in your interest to make your software easy to discover and easy to promote. For your own sake, and for the sake of any potential users of your software, I strongly urge you to ask yourself the following five questions:

  1. Is the name memorable?
  2. Does the name have one obvious pronunciation?
  3. Could I easily spell the name out to a journalist over the phone?
  4. Is the name of my database tool free from any needless mixed capitalization?
  5. Have I considered whether my software name is based on such a tenuous acronym or intialism that it will probably end up receiving a JABBA award?

To which I would add:

6. Have you searched the name in popular Internet search engines?

I read a fair amount of computer news and little is more annoying that to search for new “name” only to find it has 10 million “hits.” Any relevant to the new usage are buried somewhere in the long set of results.

Two word names do better and three even better than two. That is if you want people to find your project, paper, software.

If not, then by all means use one of the most popular child name lists. You will know where to find your work, but the rest of us won’t.

July 6, 2015

Which Functor Do You Mean?

Filed under: Homonymous,Names,Subject Identity — Patrick Durusau @ 8:34 pm

Peteris Krumins calls attention to the classic confusion of names that topic maps address in On Functors.

From the post:

It’s interesting how the term “functor” means completely different things in various programming languages. Take C++ for example. Everyone who has mastered C++ knows that you call a class that implements operator() a functor. Now take Standard ML. In ML functors are mappings from structures to structures. Now Haskell. In Haskell functors are just homomorphisms over containers. And in Prolog functor means the atom at the start of a structure. They all are different. Let’s take a closer look at each one.

Peter has said twice in the first paragraph that each of these “functors” is different. Don’t rush to his 2010 post to point out they are different. That was the point of the post. Yes?

Exercise: All of these uses of functor could be scoped by language. What properties of each “functor” would you use to distinguish them beside their language of origin?

November 25, 2014

Falsehoods Programmers Believe About Names

Filed under: Names,Programming — Patrick Durusau @ 8:36 pm

Falsehoods Programmers Believe About Names by Patrick McKenzie.

From the post:

John Graham-Cumming wrote an article today complaining about how a computer system he was working with described his last name as having invalid characters. It of course does not, because anything someone tells you is their name is — by definition — an appropriate identifier for them. John was understandably vexed about this situation, and he has every right to be, because names are central to our identities, virtually by definition.

I have lived in Japan for several years, programming in a professional capacity, and I have broken many systems by the simple expedient of being introduced into them. (Most people call me Patrick McKenzie, but I’ll acknowledge as correct any of six different “full” names, any many systems I deal with will accept precisely none of them.) Similarly, I’ve worked with Big Freaking Enterprises which, by dint of doing business globally, have theoretically designed their systems to allow all names to work in them. I have never seen a computer system which handles names properly and doubt one exists, anywhere.

So, as a public service, I’m going to list assumptions your systems probably make about names. All of these assumptions are wrong. Try to make less of them next time you write a system which touches names.

McKenzie has an admittedly incomplete list of forty (40) myths for people’s names.

If there are that many for people’s names, I wonder what the count is for all other subjects?

Including things on the Internet of Things?

I first saw this in a tweet by OnePaperPerDay.

September 10, 2014

What’s in a Name?

Filed under: Conferences,Names,Subject Identity — Patrick Durusau @ 10:56 am

What’s in a Name?

From the webpage:

What will be covered? The meeting will focus on the role of chemical nomenclature and terminology in open innovation and communication. A discussion of areas of nomenclature and terminology where there are fundamental issues, how computer software helps and hinders, the need for clarity and unambiguous definitions for application to software systems. How can you contribute? As well as the talks from expert speakers there will be plenty of opportunity for discussion and networking. A record will be made of the meeting, including the discussion, and will be made available initially to those attending the meeting. The detailed programme and names of speakers will be available closer to the date of the meeting.

Date: 21 October 2014

Event Subject(s): Industry & Technology

Venue

The Royal Society of Chemistry
Library
Burlington House
Piccadilly
London
W1J 0BA
United Kingdom

Find this location using Google Map

Contact for Event Information

Name: Prof Jeremy Frey

Address:
Chemistry
University of Southampton
United Kingdom

Email: j.g.frey@soton.ac.uk

Now there’s an event worth the hassle of overseas travel during these paranoid times! Alas, I will have to wait for the conference record to be released to non-attendees. The event is a good example of the work going on at the Royal Society of Chemistry.

I first saw this in a tweet by Open PHACTS.

July 18, 2014

Duplicate Tool Names

Filed under: Duplicates,Names — Patrick Durusau @ 9:25 am

You wait ages for somebody to develop a bioinformatics tool called ‘Kraken’ and then three come along at once by Keith Bradnam.

From the post:

So Kraken is either a universal genomic coordinate translator for comparative genomics, or a tool for ultrafast metagenomic sequence classification using exact alignments, or even a set of tools for quality control and analysis of high-throughput sequence data. The latter publication is from 2013, and the other two are from this year (2014).

Yet another illustration that names are not enough.

A URL identifier would not help unless you recognize the URL.

Identification with name/value plus other key/value pairs?

Leaves everyone free to choose whatever names they like.

It also enables the rest of us to distinguish tools (or other subjects) with the same names apart.

Simply concept. Easy to apply. Disappoints people who want to be in charge of naming things.

Sounds like three good reasons to me, especially the last one.

July 6, 2014

Finding needles in haystacks:…

Filed under: Bioinformatics,Biology,Names,Taxonomy — Patrick Durusau @ 4:54 pm

Finding needles in haystacks: linking scientific names, reference specimens and molecular data for Fungi by Conrad L. Schoch, et al. (Database (2014) 2014 : bau061 doi: 10.1093/database/bau061).

Abstract:

DNA phylogenetic comparisons have shown that morphology-based species recognition often underestimates fungal diversity. Therefore, the need for accurate DNA sequence data, tied to both correct taxonomic names and clearly annotated specimen data, has never been greater. Furthermore, the growing number of molecular ecology and microbiome projects using high-throughput sequencing require fast and effective methods for en masse species assignments. In this article, we focus on selecting and re-annotating a set of marker reference sequences that represent each currently accepted order of Fungi. The particular focus is on sequences from the internal transcribed spacer region in the nuclear ribosomal cistron, derived from type specimens and/or ex-type cultures. Re-annotated and verified sequences were deposited in a curated public database at the National Center for Biotechnology Information (NCBI), namely the RefSeq Targeted Loci (RTL) database, and will be visible during routine sequence similarity searches with NR_prefixed accession numbers. A set of standards and protocols is proposed to improve the data quality of new sequences, and we suggest how type and other reference sequences can be used to improve identification of Fungi.

Database URL: http://www.ncbi.nlm.nih.gov/bioproject/PRJNA177353

If you are interested in projects to update and correct existing databases, this is the article for you.

Fungi may not be on your regular reading list but consider one aspect of the problem described:

It is projected that there are ~400 000 fungal names already in existence. Although only 100 000 are accepted taxonomically, it still makes updates to the existing taxonomic structure a continuous task. It is also clear that these named fungi represent only a fraction of the estimated total, 1–6 million fungal species (93–95).

I would say that computer science isn’t the only discipline where “naming things” is hard.

You?

PS: The other lesson from this paper (and many others) is that semantic accuracy is not easy nor is it cheap. Anyone who says differently is lying.

May 6, 2014

The Strange Naming Conventions of Astronomy

Filed under: Astroinformatics,Names — Patrick Durusau @ 7:31 pm

The Strange Naming Conventions of Astronomy by Ben Montet.

From the post:

If you’ve spent time around the astronomical literature, you’ve probably heard at least one term that made you wonder “why did astronomers do that?” G-type stars, early/late type galaxies, magnitudes, population I/II stars, sodium “D” lines, and the various types of supernovae are all members of the large, proud family of astronomy terms that are seemingly backwards, unrelated to the underlying physics, or annoyingly complicated. While it may seem surprising now, the origins of these terms were logical at the time of their creation. Today, let’s look at the history of a couple of these terms, to figure out why astronomers did that.

Ben covers a couple of odd naming cases but has left thousands of others as an exercise for the reader!

Names that are used in astronomical literature for centuries.

The richness of names isn’t going away so long as we keep records of our past. Whatever style of names, such as “cool URIs,” may come or go out of fashion.

April 24, 2014

We have no “yellow curved fruit” today

Filed under: Humor,Names,Subject Identity — Patrick Durusau @ 8:18 pm

banana

Tweeted by Olivier Croisier with this comment:

Looks like naming things is hard not only in computer science…

Naming (read identity) problems are everywhere.

Our intellectual cocoons prevent us noticing such problems very often.

At least until something goes terribly wrong. Then the hunt is on for a scapegoat, not an explanation.

April 16, 2014

Regular expressions unleashed

Filed under: Names,Regexes,Topic Maps — Patrick Durusau @ 2:08 pm

Regular expressions unleashed by Hans-Juergen Schoenig.

From the post:

When cleaning up some old paperwork this weekend I stumbled over a very old tutorial. In fact, I have received this little handout during a UNIX course I attended voluntarily during my first year at university. It seems that those two days have really changed my life – the price tag: 100 Austrian Schillings which translates to something like 7 Euros in today’s money.

When looking at this old thing I noticed a nice example showing how to test regular expression support in grep. Over the years I had almost forgotten this little test. Here is the idea: There is no single way to print the name of Libya’s former dictator. According to this example there are around 30 ways to do it:…

Thirty (30) sounds a bit low to me but it’s sufficient to point out that mining all thirty (30) is going to give you a number of false positives, when searching for news on the former dictator of Libya.

The regex to capture all thirty (30) variant forms in a PostgreSQL database is great but once you have it, now what?

Particularly if you have sorted out the dictator from the non-dictators and/or placed them in other categories.

Do you pass that sorting and classifying onto the next user or do you flush the knowledge toilet and all that hard work just drains away?

Learn regex the hard way

Filed under: Names,Regex,Regexes — Patrick Durusau @ 1:50 pm

Learn regex the hard way by Zed A. Shaw.

From the preface:

This is a rough in-progress dump of the book. The grammar will probably be bad, there will be sections missing, but you get to watch me write the book and see how I do things.

Finally, don’t forget that I have href{http://learnpythonthehardway.org}{Learn Python The Hard Way, 2nd Edition} which you should read if you can’t code yet.

Exercises 1 – 16 have some content (out of 27) so it is incomplete but still a goodly amount of material.

Zed has other “hard way” titles on:

Regexes are useful all contexts so you won’t regret learning or brushing up on them.

March 17, 2014

Peyote and the International Plant Names Index

Filed under: Agriculture,Data,Names,Open Access,Open Data,Science — Patrick Durusau @ 1:30 pm

International Plant Names Index

What a great resource to find as we near Spring!

From the webpage:

The International Plant Names Index (IPNI) is a database of the names and associated basic bibliographical details of seed plants, ferns and lycophytes. Its goal is to eliminate the need for repeated reference to primary sources for basic bibliographic information about plant names. The data are freely available and are gradually being standardized and checked. IPNI will be a dynamic resource, depending on direct contributions by all members of the botanical community.

I entered the first plant name that came to mind: Peyote.

No “hits.” ?

Wikipedia gives Peyote’s binomial name as: Lophophora williamsii (think synonym).*

Searching on Lophophora williamsii, I got three (3) “hits.”

Had I bothered to read the FAQ before searching:

10. Can I use IPNI to search by common (vernacular) name?

No. IPNI does not include vernacular names of plants as these are rarely formally published. If you are looking for information about a plant for which you only have a common name you may find the following resources useful. (Please note that these links are to external sites which are not maintained by IPNI)

I understand the need to specialize in one form of names but “formally published” means that without a useful synonyms list, the general public has an additional burden to access publicly funded research results.

Even with a synonym list there is an additional burden because you have to look up terms in the list, then read the text with that understanding and then back to the synonym list again.

What would dramatically increase public access to publicly funded research would be to have a specialized synonym list for publications that transposes the jargon in articles to selected sets of synonyms. Would not be as precise or grammatical as the original, but it would allow the reading pubic to get a sense of even very technical research.

That could be a way to hitch topic maps to the access to publicly funded data band wagon.

Thoughts?

I first saw this in a tweet by Bill Baker.

* A couple of other fun facts from Wikipedia on Peyote: 1. It’s conservation status is listed as “apparently secure,” and 2. Wikipedia has photos of Peyote “in the wild.” I suppose saying “Peyote growing in a pot” would raise too many questions.

December 27, 2013

Naming Software?

Filed under: Names,Software — Patrick Durusau @ 4:20 pm

When you are naming software, please do not use UPPERCASE letters to distinguish your software from another name.

Why?

Because the income generating imitations of search engines regularize case, even if the terms are double quoted.

Thus, if I search for TWITter*, the first hit, (drum roll) will be: “twitter.com”

Which If I have gone to the trouble of double quoting the text, very likely isn’t what I am looking for.

Choose what you think is a good name for your software but if you want people to find it, don’t be clever with case as though it makes a difference.

*TWITer: I don’t know if this is the name of a real project or not. If it is, my apologies.

July 10, 2013

Naming Conventions for Naming Things

Filed under: Names,Semantics — Patrick Durusau @ 3:36 pm

Naming Conventions for Naming Things by David Loshin.

From the post:

In a recent email exchange with a colleague, I have been discussing two aspects of metadata: naming conventions and taxonomies. Just as a reminder, “taxonomy” refers to the practice of organization and classification, and in this context it refers to the ways that concepts are defined and how the real-world things referred to by those concepts are logically grouped together. After pondering the email thread, which was in reference to documenting code lists and organizing the codes within particular classes, I was reminded of a selection from Lewis Carroll’s book Through the Looking Glass, at the point where the White Knight is leaving Alice in her continued journey to become a queen.

At that point, the White Knight proposes to sing Alice a song to comfort her as he leaves, and in this segment they discuss the song he plans to share:

Any of you who have been following the discussion of “default semantics” in the XTM group at LinkedIn should appreciate this post.

Your default semantics are very unlikely to be my default semantics.

What I find hard to believe is that prior different semantics are acknowledged in one breath and then a uniform semantic is proposed in the next.

Seems to me that prior semantic diversity is a good sign that today we have semantic diversity. A semantic diversity that will continue into an unlimited number of tomorrows.

Yes?

If so, shouldn’t we empower users to choose their own semantics? As opposed to ours?

June 12, 2013

How does name analysis work?

Filed under: Names,Natural Language Processing — Patrick Durusau @ 2:51 pm

How does name analysis work? by Pete Warden.

From the post:

Over the last few months, I’ve been doing a lot more work with name analysis, and I’ve made some of the tools I use available as open-source software. Name analysis takes a list of names, and outputs guesses for the gender, age, and ethnicity of each person. This makes it incredibly useful for answering questions about the demographics of people in public data sets. Fundamentally though, the outputs are still guesses, and end-users need to understand how reliable the results are, so I want to talk about the strengths and weaknesses of this approach.

The short answer is that it can never work any better than a human looking at somebody else’s name and guessing their age, gender, and race. If you saw Mildred Hermann on a list of names, I bet you’d picture an older white woman, whereas Juan Hernandez brings to mind an Hispanic man, with no obvious age. It should be obvious that this is not always reliable for individuals (I bet there are some young Mildreds out there) but as the sample size grows, the errors tend to cancel each other out.

The algorithms themselves work by looking at data that’s been released by the US Census and the Social Security agency. These data sets list the popularity of 90,000 first names by gender and year of birth, and 150,000 family names by ethnicity. I then use these frequencies as the basis for all of the estimates. Crucially, all the guesses depend on how strong a correlation there is between a particular name and a person’s characteristics, which varies for each property. I’ll give some estimates of how strong these relationships are below, and I link to some papers with more rigorous quantitative evaluations below.

Not 100% as Pete points out but an interesting starting point. Plus links to more formal analysis.

March 11, 2013

Onomastics 2.0 – The Power of Social Co-Occurrences

Filed under: co-occurrence,Names,Onomastics,Subject Identity — Patrick Durusau @ 6:45 am

Onomastics 2.0 – The Power of Social Co-Occurrences by Folke Mitzlaff, Gerd Stumme.

Abstract:

Onomastics is “the science or study of the origin and forms of proper names of persons or places.” [“Onomastics”. Merriam-Webster.com, 2013. this http URL (11 February 2013)]. Especially personal names play an important role in daily life, as all over the world future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and, in particular, personal taste.

With the rise of the Social Web and its applications, users more and more interact digitally and participate in the creation of heterogeneous, distributed, collaborative data collections. These sources of data also reflect current and new naming trends as well as new emerging interrelations among names.

The present work shows, how basic approaches from the field of social network analysis and information retrieval can be applied for discovering relations among names, thus extending Onomastics by data mining techniques. The considered approach starts with building co-occurrence graphs relative to data from the Social Web, respectively for given names and city names. As a main result, correlations between semantically grounded similarities among names (e.g., geographical distance for city names) and structural graph based similarities are observed.

The discovered relations among given names are the foundation of “nameling” [this http URL], a search engine and academic research platform for given names which attracted more than 30,000 users within four months, underpinningthe relevance of the proposed methodology.

Interesting work on the co-occurrence of names.

Chosen names in this case but I wonder if the same would be true for false names?

Are there patterns to false names chosen by actors who are attempting to conceal their identities?

I first saw this in a tweet by Stefano Bertolo.

September 5, 2012

Naming and the Curse of Dimensionality

Filed under: Dimension Reduction,Names — Patrick Durusau @ 3:23 pm

Wikipedia introduces its article on the Curse of Dimensionality with:

In numerical analysis the curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.

There are multiple phenomena referred to by this name in domains such as sampling, combinatorics, machine learning and data mining. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data however all objects appear to be sparse and dissimilar in many ways which prevents common data organization strategies from being efficient.

The term curse of dimensionality was coined by Richard E. Bellman when considering problems in dynamic optimization.[1][2]

The “curse of dimensionality” is often used as a blanket excuse for not dealing with high-dimensional data. However, the effects are not yet completely understood by the scientific community, and there is ongoing research. On one hand, the notion of intrinsic dimension refers to the fact that any low-dimensional data space can trivially be turned into a higher dimensional space by adding redundant (e.g. duplicate) or randomized dimensions, and in turn many high-dimensional data sets can be reduced to lower dimensional data without significant information loss. This is also reflected by the effectiveness of dimension reduction methods such as principal component analysis in many situations. For distance functions and nearest neighbor search, recent research also showed that data sets that exhibit the curse of dimensionality properties can still be processed unless there are too many irrelevant dimensions, while relevant dimensions can make some problems such as cluster analysis actually easier.[3][4] Secondly, methods such as Markov chain Monte Carlo or shared nearest neighbor methods[3] often work very well on data that were considered intractable by other methods due to high dimensionality.

But dimensionality isn’t limited to numerical analysis. Nor is its reduction.

Think about the number of dimensions along which you have information about your significant other, friends or co-authors. Or any other subject, abstract or concrete, that you care to name.

However many dimensions you can name for any given subject, in human discourse we don’t refer to that dimensionality as the “curse of dimensionality.”

In fact, we don’t notice the dimensionality at all. Why?

We reduce all those dimensions into a name for the subject and that name is what we use in human discourse.

Dimensional reduction to names goes a long way to explaining why we get confused by names.

Another speaker has reduced a different set of dimensions (which are not shown as it were) to the same name that we use as the reduction of a different set of dimensions.

Sometimes the same name will expand into a different set of dimensions and sometimes different names expand into the same set of dimensions.

One of those dimensions being the context of usage, which when our expansion of the name doesn’t fit, prompts us to ask the speaker for one or more additional dimensions to identify the subject of discussion.

We do that effortlessly, reducing and expanding dimensions to and from names in the course of a conversation. Or when reading or writing.

The number of dimensions for any name increases as we know more about any given subject. Not to mention being impacted by our interaction with others who use the same name as we adjust, repair or change the dimensions we expand or reduce for any particular name.

Dimensionality isn’t a curse. The difficulties we associate with dimensionality and numeric analysis are a consequence of using an underpowered tool, that’s all.

August 11, 2012

Confusing Statistical Term #7: GLM

Filed under: Names,Statistics — Patrick Durusau @ 3:43 pm

Confusing Statistical Term #7: GLM by Karen Grace-Martin.

From the post:

Like some of the other terms in our list–level and beta–GLM has two different meanings.

It’s a little different than the others, though, because it’s an abbreviation for two different terms:

General Linear Model and Generalized Linear Model.

It’s extra confusing because their names are so similar on top of having the same abbreviation.

And, oh yeah, Generalized Linear Models are an extension of General Linear Models.

And neither should be confused with Generalized Linear Mixed Models, abbreviated GLMM.

Naturally.

So what’s the difference? And does it really matter?

As you probably have guessed, yes.

You will need a reading knowledge of statistics to really appreciate the post. If you don’t have such knowledge, now would be a good time to pick it up.

Statistics are a way of summarizing information about subjects. You can rely on the judgements of others on such summaries or you can have your own.

April 29, 2012

Semantically Diverse Christenings

Filed under: Identity,Names,Semantic Diversity — Patrick Durusau @ 12:09 pm

Mark Liberman in Neutral Xi_b^star, Xi(b)^{*0}, Ξb*0, whatever at Language Log reports semantically diverse christenings of the same new subatomic particle.

I count eight or nine distinct names in Liberman’s report.

How many do you see?

This is just days after its discovery at the CERN.

Largely in the scientific literature. (It will get far worse if you include non-technical literature. Is non-technical literature/discussion relevant?)

Question for science librarians:

How many names for this new subatomic particle will you use in searches?

April 18, 2012

Bad Names, Renaming, …?

Filed under: Identifiers,Names — Patrick Durusau @ 6:06 pm

David Loshin as a series of posts going at the Data Roundtable:

The Perils of Bad Names

and

The Impact of Data Element Renaming…

In “Bad Names,” David cites this example:

An example of this might be a column named “STREET_ADDRESS,” but that instead of that field holding a street number and name, it contains a set of flags indicating the types of customer correspondences that are to be sent to a home address instead of an email address. From one perspective, our assumption about what was stored in that field were mistaken, but on the other hand, conventional wisdom might have suggested otherwise.

I would agree, that at least looks like a bad name. Moreover, its one that is likely to trip up successors who have to deal with the data set.

David goes on to argue in “Renaming,” that finding and replacing all the uses of this name may lead to worse problems.

Ah, after thinking about it for a bit, I can see he has a point.

How about you?

December 27, 2011

scikits-image – Name Change

Filed under: Image Processing,Machine Learning,Names,Python — Patrick Durusau @ 7:13 pm

scikits-image – Name Change.

Speaking of naming issues, do note that scikits-image has become skimage, although as of 27 December 2011, PyPi – The Python Package Index isn’t aware of the change.

On the other hand, a search for sklearn (the new name for scikit-learn) resolves to the current package name scikit-learn-0.9.tar.gz.

I will drop the administrators a note because the text shifts between the two names without explanation on sklearn.

I got clued in about the change at: http://pythonvision.org/blog/2011/December/skimage04.

So, how do we deal with all the prior uses of the “scikits-image” and “scikit-learn” identifiers that are about to be disconnected from the software they once named?

Eventually the package pages will be innocent of either one, save perhaps in increasingly old change logs.

Assume I run across a blog post or article that is two or three years old with an interesting technique that uses the old names. Other than by chance, how do I find the package under its new name? And if I do find it, how can I save other people from the same time investment and depending on luck for the result?

To be sure, the package search mechanism puts me out at the right place but what if I am not expecting the resolution to another name? Will I think this is another package?

March 1, 2011

Indexing by Properties

Filed under: Identifiers,Names,Properties — Patrick Durusau @ 10:09 am

When I was researching the …grain of salt post I happened across the entry for sodium chloride at Wikipedia.

I don’t know how many times I have looked at Wikipedia pages but that day I noticed the headings in the sidebar that read:

IUPAC name (International Union of Pure and Applied Chemistry nomenclature)
Other names
Identifiers
Properties
Structure
Hazards
Related Compounds
Supplementary data page

Think about it for a minute.

Substances don’t arrive in labs, say for example the fictional labs seen on CSI with IUPAC names, other names, or even identifiers.

How are they identified? Can you say by their properties?

Now there is an odd dis-connect between indexing and identification.

That is indexing is by names and identifiers, both of which are known to be weak, rather than by properties.

Now there is an idea, an indexer that marshals properties for any index entry and can report why a particular entry was made.

We would not accept any less from a lab analysis, I wonder why we accept it from our indexers?

Subjects, other than substances, also have properties, including relationships to other subjects.

Identifiers and locators in topic maps are quick and convenient ways to navigate topic maps and the subjects represented therein.

We should now allow that convenience to blind us to the deeper complexity of reliable identification of subjects by their properties.

Indexing based upon more than names and identifiers looks like a largely unexplored landscape and one where topic maps could make an original contribution to the art of indexing.

Well, to be honest, topic maps would be making explicit what indexers have been doing for years. Which would make it even more valuable.

Indexing by Properties. Has a nice ring to it doesn’t it?

Has a number of implications for semantic web technologies, but more on that anon.

February 22, 2011

LingPipe Baseline for MITRE Name Matching Challenge

Filed under: LingPipe,Names — Patrick Durusau @ 1:36 pm

LingPipe Baseline for MITRE Name Matching Challenge.

Bob Carpenter walks though the use of LingPipe in connection with the MITRE Name Matching Challenge.

There are many complex issues in data mining but doing well on basic tasks is always a good starting place.

November 28, 2010

Names, Identifiers, LOD, and the Semantic Web

Filed under: LOD,Names,RDF,Semantic Web,Subject Identifiers — Patrick Durusau @ 5:28 pm

I have been watching the identifier debate in the LOD community with its revisionists, personal accounts and other takes on what the problem is, if there is a problem and how to solve the problem if there is one.

I have a slightly different question: What happens when we have a name/identifier?

Short of being present when someone points to or touches an object, themselves, you (if the TSA) and says a name or identifier, what happens?

Try this experiment. Take a sheet of paper and write: George W. Bush.

Now write 10 facts about George W. Bush.

Please circle which ones that you think must match to identify George W. Bush.

So, even though you knew the name George W. Bush, isn’t it fair to say that the circled facts are what you would use to identify George W. Bush?

Here’s is the fun part: Get a colleague or co-worker to do the same experiment. (Substitute Lady Gaga if your friends don’t know enough facts about George W. Bush.)

Now compare several sets of answers for the same person.

Working from the same name, you most likely listed different facts and different ones you would use to identify that subject.

Even though most of you would agree that some or all of the facts listed go with that person.

It sounds like even though we use identifiers/names, those just clue us in on facts, some of which we use to make the identification.

That’s the problem isn’t it?

A name or identifier can make us think of different facts (possibly identifying different subjects) and even if the same subject, we may use different facts to identify the subject.

Assuming we are at a set of facts (RDF graph, whatever) we need to know: What facts identify the subject?

And a subject may have different identifying properties, depending on the context of identification.

Questions:

  1. How to specify essential facts for identification as opposed to the extra ones?
  2. How to answer #1 for an RDF graph?
  3. How do you make others aware of your answer in #2?

Comments/suggestions?

Powered by WordPress