Archive for the ‘Names’ Category

Onomastics 2.0 – The Power of Social Co-Occurrences

Monday, March 11th, 2013

Onomastics 2.0 – The Power of Social Co-Occurrences by Folke Mitzlaff, Gerd Stumme.

Abstract:

Onomastics is “the science or study of the origin and forms of proper names of persons or places.” ["Onomastics". Merriam-Webster.com, 2013. this http URL (11 February 2013)]. Especially personal names play an important role in daily life, as all over the world future parents are facing the task of finding a suitable given name for their child. This choice is influenced by different factors, such as the social context, language, cultural background and, in particular, personal taste.

With the rise of the Social Web and its applications, users more and more interact digitally and participate in the creation of heterogeneous, distributed, collaborative data collections. These sources of data also reflect current and new naming trends as well as new emerging interrelations among names.

The present work shows, how basic approaches from the field of social network analysis and information retrieval can be applied for discovering relations among names, thus extending Onomastics by data mining techniques. The considered approach starts with building co-occurrence graphs relative to data from the Social Web, respectively for given names and city names. As a main result, correlations between semantically grounded similarities among names (e.g., geographical distance for city names) and structural graph based similarities are observed.

The discovered relations among given names are the foundation of “nameling” [this http URL], a search engine and academic research platform for given names which attracted more than 30,000 users within four months, underpinningthe relevance of the proposed methodology.

Interesting work on the co-occurrence of names.

Chosen names in this case but I wonder if the same would be true for false names?

Are there patterns to false names chosen by actors who are attempting to conceal their identities?

I first saw this in a tweet by Stefano Bertolo.

Naming and the Curse of Dimensionality

Wednesday, September 5th, 2012

Wikipedia introduces its article on the Curse of Dimensionality with:

In numerical analysis the curse of dimensionality refers to various phenomena that arise when analyzing and organizing high-dimensional spaces (often with hundreds or thousands of dimensions) that do not occur in low-dimensional settings such as the physical space commonly modeled with just three dimensions.

There are multiple phenomena referred to by this name in domains such as sampling, combinatorics, machine learning and data mining. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data becomes sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. Also organizing and searching data often relies on detecting areas where objects form groups with similar properties; in high dimensional data however all objects appear to be sparse and dissimilar in many ways which prevents common data organization strategies from being efficient.

The term curse of dimensionality was coined by Richard E. Bellman when considering problems in dynamic optimization.[1][2]

The “curse of dimensionality” is often used as a blanket excuse for not dealing with high-dimensional data. However, the effects are not yet completely understood by the scientific community, and there is ongoing research. On one hand, the notion of intrinsic dimension refers to the fact that any low-dimensional data space can trivially be turned into a higher dimensional space by adding redundant (e.g. duplicate) or randomized dimensions, and in turn many high-dimensional data sets can be reduced to lower dimensional data without significant information loss. This is also reflected by the effectiveness of dimension reduction methods such as principal component analysis in many situations. For distance functions and nearest neighbor search, recent research also showed that data sets that exhibit the curse of dimensionality properties can still be processed unless there are too many irrelevant dimensions, while relevant dimensions can make some problems such as cluster analysis actually easier.[3][4] Secondly, methods such as Markov chain Monte Carlo or shared nearest neighbor methods[3] often work very well on data that were considered intractable by other methods due to high dimensionality.

But dimensionality isn’t limited to numerical analysis. Nor is its reduction.

Think about the number of dimensions along which you have information about your significant other, friends or co-authors. Or any other subject, abstract or concrete, that you care to name.

However many dimensions you can name for any given subject, in human discourse we don’t refer to that dimensionality as the “curse of dimensionality.”

In fact, we don’t notice the dimensionality at all. Why?

We reduce all those dimensions into a name for the subject and that name is what we use in human discourse.

Dimensional reduction to names goes a long way to explaining why we get confused by names.

Another speaker has reduced a different set of dimensions (which are not shown as it were) to the same name that we use as the reduction of a different set of dimensions.

Sometimes the same name will expand into a different set of dimensions and sometimes different names expand into the same set of dimensions.

One of those dimensions being the context of usage, which when our expansion of the name doesn’t fit, prompts us to ask the speaker for one or more additional dimensions to identify the subject of discussion.

We do that effortlessly, reducing and expanding dimensions to and from names in the course of a conversation. Or when reading or writing.

The number of dimensions for any name increases as we know more about any given subject. Not to mention being impacted by our interaction with others who use the same name as we adjust, repair or change the dimensions we expand or reduce for any particular name.

Dimensionality isn’t a curse. The difficulties we associate with dimensionality and numeric analysis are a consequence of using an underpowered tool, that’s all.

Confusing Statistical Term #7: GLM

Saturday, August 11th, 2012

Confusing Statistical Term #7: GLM by Karen Grace-Martin.

From the post:

Like some of the other terms in our list–level and beta–GLM has two different meanings.

It’s a little different than the others, though, because it’s an abbreviation for two different terms:

General Linear Model and Generalized Linear Model.

It’s extra confusing because their names are so similar on top of having the same abbreviation.

And, oh yeah, Generalized Linear Models are an extension of General Linear Models.

And neither should be confused with Generalized Linear Mixed Models, abbreviated GLMM.

Naturally.

So what’s the difference? And does it really matter?

As you probably have guessed, yes.

You will need a reading knowledge of statistics to really appreciate the post. If you don’t have such knowledge, now would be a good time to pick it up.

Statistics are a way of summarizing information about subjects. You can rely on the judgements of others on such summaries or you can have your own.

Semantically Diverse Christenings

Sunday, April 29th, 2012

Mark Liberman in Neutral Xi_b^star, Xi(b)^{*0}, Ξb*0, whatever at Language Log reports semantically diverse christenings of the same new subatomic particle.

I count eight or nine distinct names in Liberman’s report.

How many do you see?

This is just days after its discovery at the CERN.

Largely in the scientific literature. (It will get far worse if you include non-technical literature. Is non-technical literature/discussion relevant?)

Question for science librarians:

How many names for this new subatomic particle will you use in searches?

Bad Names, Renaming, …?

Wednesday, April 18th, 2012

David Loshin as a series of posts going at the Data Roundtable:

The Perils of Bad Names

and

The Impact of Data Element Renaming…

In “Bad Names,” David cites this example:

An example of this might be a column named “STREET_ADDRESS,” but that instead of that field holding a street number and name, it contains a set of flags indicating the types of customer correspondences that are to be sent to a home address instead of an email address. From one perspective, our assumption about what was stored in that field were mistaken, but on the other hand, conventional wisdom might have suggested otherwise.

I would agree, that at least looks like a bad name. Moreover, its one that is likely to trip up successors who have to deal with the data set.

David goes on to argue in “Renaming,” that finding and replacing all the uses of this name may lead to worse problems.

Ah, after thinking about it for a bit, I can see he has a point.

How about you?

scikits-image – Name Change

Tuesday, December 27th, 2011

scikits-image – Name Change.

Speaking of naming issues, do note that scikits-image has become skimage, although as of 27 December 2011, PyPi – The Python Package Index isn’t aware of the change.

On the other hand, a search for sklearn (the new name for scikit-learn) resolves to the current package name scikit-learn-0.9.tar.gz.

I will drop the administrators a note because the text shifts between the two names without explanation on sklearn.

I got clued in about the change at: http://pythonvision.org/blog/2011/December/skimage04.

So, how do we deal with all the prior uses of the “scikits-image” and “scikit-learn” identifiers that are about to be disconnected from the software they once named?

Eventually the package pages will be innocent of either one, save perhaps in increasingly old change logs.

Assume I run across a blog post or article that is two or three years old with an interesting technique that uses the old names. Other than by chance, how do I find the package under its new name? And if I do find it, how can I save other people from the same time investment and depending on luck for the result?

To be sure, the package search mechanism puts me out at the right place but what if I am not expecting the resolution to another name? Will I think this is another package?

Indexing by Properties

Tuesday, March 1st, 2011

When I was researching the …grain of salt post I happened across the entry for sodium chloride at Wikipedia.

I don’t know how many times I have looked at Wikipedia pages but that day I noticed the headings in the sidebar that read:

IUPAC name (International Union of Pure and Applied Chemistry nomenclature)
Other names
Identifiers
Properties
Structure
Hazards
Related Compounds
Supplementary data page

Think about it for a minute.

Substances don’t arrive in labs, say for example the fictional labs seen on CSI with IUPAC names, other names, or even identifiers.

How are they identified? Can you say by their properties?

Now there is an odd dis-connect between indexing and identification.

That is indexing is by names and identifiers, both of which are known to be weak, rather than by properties.

Now there is an idea, an indexer that marshals properties for any index entry and can report why a particular entry was made.

We would not accept any less from a lab analysis, I wonder why we accept it from our indexers?

Subjects, other than substances, also have properties, including relationships to other subjects.

Identifiers and locators in topic maps are quick and convenient ways to navigate topic maps and the subjects represented therein.

We should now allow that convenience to blind us to the deeper complexity of reliable identification of subjects by their properties.

Indexing based upon more than names and identifiers looks like a largely unexplored landscape and one where topic maps could make an original contribution to the art of indexing.

Well, to be honest, topic maps would be making explicit what indexers have been doing for years. Which would make it even more valuable.

Indexing by Properties. Has a nice ring to it doesn’t it?

Has a number of implications for semantic web technologies, but more on that anon.

LingPipe Baseline for MITRE Name Matching Challenge

Tuesday, February 22nd, 2011

LingPipe Baseline for MITRE Name Matching Challenge.

Bob Carpenter walks though the use of LingPipe in connection with the MITRE Name Matching Challenge.

There are many complex issues in data mining but doing well on basic tasks is always a good starting place.

Names, Identifiers, LOD, and the Semantic Web

Sunday, November 28th, 2010

I have been watching the identifier debate in the LOD community with its revisionists, personal accounts and other takes on what the problem is, if there is a problem and how to solve the problem if there is one.

I have a slightly different question: What happens when we have a name/identifier?

Short of being present when someone points to or touches an object, themselves, you (if the TSA) and says a name or identifier, what happens?

Try this experiment. Take a sheet of paper and write: George W. Bush.

Now write 10 facts about George W. Bush.

Please circle which ones that you think must match to identify George W. Bush.

So, even though you knew the name George W. Bush, isn’t it fair to say that the circled facts are what you would use to identify George W. Bush?

Here’s is the fun part: Get a colleague or co-worker to do the same experiment. (Substitute Lady Gaga if your friends don’t know enough facts about George W. Bush.)

Now compare several sets of answers for the same person.

Working from the same name, you most likely listed different facts and different ones you would use to identify that subject.

Even though most of you would agree that some or all of the facts listed go with that person.

It sounds like even though we use identifiers/names, those just clue us in on facts, some of which we use to make the identification.

That’s the problem isn’t it?

A name or identifier can make us think of different facts (possibly identifying different subjects) and even if the same subject, we may use different facts to identify the subject.

Assuming we are at a set of facts (RDF graph, whatever) we need to know: What facts identify the subject?

And a subject may have different identifying properties, depending on the context of identification.

Questions:

  1. How to specify essential facts for identification as opposed to the extra ones?
  2. How to answer #1 for an RDF graph?
  3. How do you make others aware of your answer in #2?

Comments/suggestions?