Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 17, 2018

Query Expansion Techniques for Information Retrieval: a Survey

Filed under: Query Expansion,Subject Identity,Subject Recognition,Topic Maps — Patrick Durusau @ 9:12 pm

Query Expansion Techniques for Information Retrieval: a Survey by Hiteshwar Kumar Azad, Akshay Deepak.

With the ever increasing size of web, relevant information extraction on the Internet with a query formed by a few keywords has become a big challenge. To overcome this, query expansion (QE) plays a crucial role in improving the Internet searches, where the user’s initial query is reformulated to a new query by adding new meaningful terms with similar significance. QE — as part of information retrieval (IR) — has long attracted researchers’ attention. It has also become very influential in the field of personalized social document, Question Answering over Linked Data (QALD), and, Text Retrieval Conference (TREC) and REAL sets. This paper surveys QE techniques in IR from 1960 to 2017 with respect to core techniques, data sources used, weighting and ranking methodologies, user participation and applications (of QE techniques) — bringing out similarities and differences.

Another goodie for the upcoming holiday season. At forty-three (43) pages and needing updating, published in 2017, a real joy for anyone interested in query expansion.

Writing this post I realized that something is missing in discussions of query expansion. It is assumed that end-users are querying the data set and they are called upon to evaluate the results.

What if we change that assumption to an expert user querying the data set and authoring filtered results for end users?

Instead of presenting an end user with a topic map, no matter how clever its merging rules, they are presented with a curated information resource.

Granting that an expert may have been using a topic map to produce the curated information resource but of what concern is that for the end user?

November 16, 2017

Shape Searching Dictionaries?

Facebook, despite its spying, censorship, and being a shill for the U.S. government, isn’t entirely without value.

For example, this post by Simon St. Laurent:

Drew this response from Peter Cooper:

Which if you follow the link: Shapecatcher: Unicode Character Recognition you find:

Draw something in the box!

And let shapecatcher help you to find the most similar unicode characters!

Currently, there are 11817 unicode character glyphs in the database. Japanese, Korean and Chinese characters are currently not supported.
(emphasis in original)

I take “Japanese, Korean and Chinese characters are currently not supported.” means Anatolian Hieroglyphs; Cuneiform, Cuneiform Numbers and Punctuation, Early Dynastic Cuneiform, Old Persian, Ugaritic; Egyptian Hieroglyphs; Meroitic Cursive, and Meroitic Hieroglphs are not supported as well.

But my first thought wasn’t discovery of glyphs in Unicode Code Charts, although useful, but shape searching dictionaries, such as Faulkner’s A Concise Dictionary of Middle Egyptian.

A sample from Faulkner’s (1991 edition):

Or, The Student’s English-Sanskrit Dictionary by Vaman Shivram Apte (1893):

Imagine being able to search by shape for either dictionary! Not just as a gylph but as a set of glyphs, within any entry!

I suspect that’s doable based on Benjamin Milde‘s explanation of Shapecatcher:


Under the hood, Shapecatcher uses so called “shape contexts” to find similarities between two shapes. Shape contexts, a robust mathematical way of describing the concept of similarity between shapes, is a feature descriptor first proposed by Serge Belongie and Jitendra Malik.

You can find an indepth explanation of the shape context matching framework that I used in my Bachelor thesis (“On the Security of reCATPCHA”). In the end, it is quite a bit different from the matching framework that Belongie and Malik proposed in 2000, but still based on the idea of shape contexts.

The engine that runs this site is a rewrite of what I developed during my bachelor thesis. To make things faster, I used CUDA to accelereate some portions of the framework. This is a fairly new technology that enables me to use my NVIDIA graphics card for general purpose computing. Newer cards are quite powerful devices!

That was written in 2011 and no doubt shape matching has progressed since then.

No technique will be 100% but even less than 100% accuracy will unlock generations of scholarly dictionaries, in ways not imagined by their creators.

If you are interested, I’m sure Benjamin Milde would love to hear from you.

February 13, 2016

You Can Confirm A Gravity Wave!

Filed under: Physics,Python,Science,Signal Processing,Subject Identity,Subject Recognition — Patrick Durusau @ 5:35 pm

Unless you have been unconscious since last Wednesday, you have heard about the confirmation of Einstein’s 1916 prediction of gravitational waves.

An very incomplete list of popular reports include:

Einstein, A Hunch And Decades Of Work: How Scientists Found Gravitational Waves (NPR)

Einstein’s gravitational waves ‘seen’ from black holes (BBC)

Gravitational Waves Detected, Confirming Einstein’s Theory (NYT)

Gravitational waves: breakthrough discovery after a century of expectation (Guardian)

For the full monty, see the LIGO Scientific Collaboration itself.

Which brings us to the iPython notebook with the gravitational wave discovery data: Signal Processing with GW150914 Open Data

From the post:

Welcome! This ipython notebook (or associated python script GW150914_tutorial.py ) will go through some typical signal processing tasks on strain time-series data associated with the LIGO GW150914 data release from the LIGO Open Science Center (LOSC):

To begin, download the ipython notebook, readligo.py, and the data files listed below, into a directory / folder, then run it. Or you can run the python script GW150914_tutorial.py. You will need the python packages: numpy, scipy, matplotlib, h5py.

On Windows, or if you prefer, you can use a python development environment such as Anaconda (https://www.continuum.io/why-anaconda) or Enthought Canopy (https://www.enthought.com/products/canopy/).

Questions, comments, suggestions, corrections, etc: email losc@ligo.org

v20160208b

Unlike the toadies at the New England Journal of Medicine, Parasitic Re-use of Data? Institutionalizing Toadyism, Addressing The Concerns Of The Selfish, the scientists who have labored for decades on the gravitational wave question are giving their data away for free!

Not only giving the data away, but striving to help others learn to use it!

Beyond simply “doing the right thing,” and setting an example for other scientists, this is a great opportunity to learn more about signal processing.

Signal processing being an important method of “subject identification” when you stop to think about it in a large number of domains.

Detecting a gravity wave is beyond your personal means but with the data freely available…, further analysis is a matter of interest and perseverance.

August 1, 2015

Things That Are Clear In Hindsight

Filed under: Social Sciences,Subject Identity,Subject Recognition — Patrick Durusau @ 12:10 pm

Sean Gallagher recently tweeted:

Oh look, the Triumphalism Trilogy is now a boxed set.

triumphalism-trilogy

In case you are unfamiliar with the series, The Tipping Point, Blink, Outliers.

Although entertaining reads, particularly The Tipping Point (IMHO), Gladwell does not describe how to recognize a tipping point in advance of it being a tipping point, nor how to make good decisions without thinking (Blink) or how to recognize human potential before success (Outliers).

Tipping points, good decisions and human potential can be recognized only when they are manifested.

As you can tell from Gladwell’s book sales, selling the hope of knowing the unknowable, remains a viable market.

October 27, 2014

Data Modelling: The Thin Model [Entities with only identifiers]

Filed under: Data Models,Subject Identifiers,Subject Identity,Subject Recognition — Patrick Durusau @ 3:57 pm

Data Modelling: The Thin Model by Mark Needham.

From the post:

About a third of the way through Mastering Data Modeling the authors describe common data modelling mistakes and one in particular resonated with me – ‘Thin LDS, Lost Users‘.

LDS stands for ‘Logical Data Structure’ which is a diagram depicting what kinds of data some person or group wants to remember. In other words, a tool to help derive the conceptual model for our domain.

They describe the problem that a thin model can cause as follows:

[…] within 30 minutes [of the modelling session] the users were lost…we determined that the model was too thin. That is, many entities had just identifying descriptors.

While this is syntactically okay, when we revisited those entities asking, What else is memorable here? the users had lots to say.

When there was flesh on the bones, the uncertainty abated and the session took a positive course.

I found myself making the same mistake a couple of weeks ago during a graph modelling session. I tend to spend the majority of the time focused on the relationships between the bits of data and treat the meta data or attributes almost as an after thought.

A good example of why subjects need multiple attributes, even multiple identifying attributes.

When sketching just a bare data model, the author, having prepared in advance is conversant with the scant identifiers. The audience, on the other hand is not. Additional attributes for each entity quickly reminds the audience of the entity in question.

Take this as anecdotal evidence that multiple attributes assist users in recognition of entities (aka subjects).

Will that impact how you identify subjects for your users?

August 27, 2014

You Say “Concepts” I Say “Subjects”

Researchers are cracking text analysis one dataset at a time by Derrick Harris.

From the post:

Google on Monday released the latest in a string of text datasets designed to make it easier for people outside its hallowed walls to build applications that can make sense of all the words surrounding them.

As explained in a blog post, the company analyzed the New York Times Annotated Corpus — a collection of millions of articles spanning 20 years, tagged for properties such as people, places and things mentioned — and created a dataset that ranks the salience (or relative importance) of every name mentioned in each one of those articles.

Essentially, the goal with the dataset is to give researchers a base understanding of which entities are important within particular pieces of content, an understanding that should then be complemented with background data sources that will provide even more information. So while the number of times a person or company is mentioned in an article can be a very strong sign of which words are important — especially when compared to the usual mention count for that word, one of the early methods for ranking search results — a more telling method of ranking importance would also leverage existing knowledge of broader concepts to capture important words that don’t stand out from a volume perspective.

A summary of some of the recent work on recognizing concepts in text and not just key words.

As topic mappers know, there is no universal one to one correspondence between words and subjects (“concepts” in this article). Finding “concepts” means that whatever words triggered that recognition, we can supply other information that is known about the same concept.

Certainly will make topic map authoring easier when text analytics can generate occurrence data and decorate existing topic maps with their findings.

April 2, 2013

Construction of Controlled Vocabularies

Filed under: Identity,Subject Identity,Subject Recognition,Vocabularies — Patrick Durusau @ 2:01 pm

Construction of Controlled Vocabularies: A Primer by Marcia Lei Zeng.

From the “why” page:

Vocabulary control is used to improve the effectiveness of information storage and retrieval systems, Web navigation systems, and other environments that seek to both identify and locate desired content via some sort of description using language. The primary purpose of vocabulary control is to achieve consistency in the description of content objects and to facilitate retrieval.

1.1 Need for Vocabulary Control (1.1)

The need for vocabulary control arises from two basic features of natural language, namely:

• Two or more words or terms can be used to represent a single concept

Example:
salinity/saltiness
  VHF/Very High Frequency

• Two or more words that have the same spelling can represent different concepts

Example:
Mercury (planet)
  Mercury (metal)
  Mercury (automobile)
  Mercury (mythical being)

Great examples for vocabulary control but for topic maps as well!

The topic map question is:

What do you know about the subject(s) in either case, that would make you say the words mean the same subject or different subjects?

If we can capture the information you think makes them represent the same or different subjects, there is a basis for repeating that comparison.

Perhaps even automatically.

Mary Jane pointed out this resource in a recent comment.

August 5, 2012

> 4,000 Ways to say “You’re OK” [Breast Cancer Diagnosis]

The feasibility of using natural language processing to extract clinical information from breast pathology reports by Julliette M Buckley, et.al.

Abstract:

Objective: The opportunity to integrate clinical decision support systems into clinical practice is limited due to the lack of structured, machine readable data in the current format of the electronic health record. Natural language processing has been designed to convert free text into machine readable data. The aim of the current study was to ascertain the feasibility of using natural language processing to extract clinical information from >76,000 breast pathology reports.

Approach and Procedure: Breast pathology reports from three institutions were analyzed using natural language processing software (Clearforest, Waltham, MA) to extract information on a variety of pathologic diagnoses of interest. Data tables were created from the extracted information according to date of surgery, side of surgery, and medical record number. The variety of ways in which each diagnosis could be represented was recorded, as a means of demonstrating the complexity of machine interpretation of free text.

Results: There was widespread variation in how pathologists reported common pathologic diagnoses. We report, for example, 124 ways of saying invasive ductal carcinoma and 95 ways of saying invasive lobular carcinoma. There were >4000 ways of saying invasive ductal carcinoma was not present. Natural language processor sensitivity and specificity were 99.1% and 96.5% when compared to expert human coders.

Conclusion: We have demonstrated how a large body of free text medical information such as seen in breast pathology reports, can be converted to a machine readable format using natural language processing, and described the inherent complexities of the task.

The advantages of using current language practices include:

  • No new vocabulary needs to be developed.
  • No adoption curve for a new vocabulary.
  • No training required for users to introduce the new vocabulary
  • Works with historical data.

and I am sure there are others.

Add natural language usage to your topic map for immediately useful results for your clients.

October 24, 2011

Subject Recognition: Discrete or Continuous

Filed under: Artificial Intelligence,Subject Recognition — Patrick Durusau @ 6:43 pm

While creating the entry for Fast Deep/Recurrent Nets for AGI Vision, I took particular note of the unbroken hand writing competitions. That task, for computer vision, is more difficult than “segmented” hand writing with breaks between the letters.

Are there parallels to subject recognition as performed by our computers versus ourselves?

That is we record and use “discrete” values in computers that are used for subject recognition.

We as human observers report “discrete” values when asked about subject recognition but in fact recognize subjects along non-discrete continuum of values.

I am interested in the application of techniques similar to continuous handwriting recognition applied to subject recognition.

Comments?

Powered by WordPress