Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

November 25, 2010

Sig.ma – Live views on the Web of Data

Filed under: Indexing,Information Retrieval,Lucene,Mapping,RDF,Search Engines,Semantic Web — Patrick Durusau @ 10:27 am

Sig.ma – Live views on the Web of Data

From the website:

In Sig.ma, elements such as large scale semantic web indexing, logic reasoning, data aggregation heuristics, pragmatic ontology alignments and, last but not least, user interaction and refinement, all play together to provide entity descriptions which become live, embeddable data mash ups.

Read one of various versions of an article on Sig.ma for the technical details.

From the Web Technologies article cited on the homepage:

Sig.ma revolves around the creation of Entity Profiles. An entity profile – which in the Sig.ma dataflow is represented by the “data cache” storage (Fig. 3) – is a summary of an entity that is presented to the user in a visual interface, or which can be returned by the API as a rich JSON object or a RDF document. Entity profiles usually include information that is aggregated from more than one source. The basic structure of an entity profile is a set of key-value pairs that describe the entity. Entity profiles often refer to other entities, for example the profile of a person might refer to their publications.

No, this isn’t an implementation of the TMRM.

This is an implementation of one way to view entities for a particular type of data. A very exciting one but still limited to a particular data set.

This is a big step forward.

For example, it isn’t hard to imagine entity profiles against particular websites or data sets. Entity profiles that are maintained and leased for use with search engines like Sig.ma.

Or going a bit further and declaring a basis for identification of subjects, such as the existence of properties a…n in an RDF graph.

Questions:

  1. Spend a couple of hours with Sig.ma researching library related questions. (Discussion)
  2. What did you like, dislike or find surprising about Sig.ma? (3-5 pages, no citations)
  3. Entity profiles for library science (Class project)

Sig.ma: Live Views on the web of data – bibliography issues

I normally start with a DOI here so you can see article in question.

Not here.

Here’s why:

Sig.ma: Live views on the Web of Data Journal of Web Semantics. (10 pages)

Sig.ma: Live Views on the Web of Data WWW ’10 Proceedings(demo, 4 pages)

Sig.ma: Live Views on the Web of Data (8 pages) http://richard.cyganiak.de/2008/papers/sigma-semwebchallenge2009.pdf

Sig.ma: Live Views on the Web of Data (4 pages) http://richard.cyganiak.de/2008/papers/sigma-demo-www2010.pdf

Sig.ma: Live Views on the Web of Data (25 pages) http://fooshed.net/paper/JWS2010.pdf

Before saying anything ugly, ;-), this is some of the most exciting research I have seen in a long time. I will cover that part of it in a following post. But, to the matter at hand, bibliographic control.

Five (5) different articles, two published in recognized journals that all have the same name? (The demo articles are the same but have different headers/footers, page numbers and so would likely be indexed as different articles.)

I will be able to resolve any confusion by obtaining the article in question.

But that isn’t an excuse.

I, along with everyone else interested in this research, will waste a small part of our time resolving the confusion. Confusion that could have been avoided for everyone.

Not unlike everyone who does the same search having to tread the same google glut.

With no way to pass on what we have resolved, for the benefit of others.

Questions:

  1. Help these authors out. How would you suggest they avoid this in the future? Use of the name is important. (3-5 pages, no citations)
  2. Help the library out. How will you deal with multiple papers with the same title, authors, pub year? (this isn’t uncommon) (3-5 pages, citations optional)
  3. How would you use topic maps to resolve this issue? (3-5 pages, no citations)

The Genomics and Bioinformatics Group

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 7:16 am

The Genomics and Bioinformatics Group

From the website:

The GBG’s mission is to manage and assess molecular interaction data obtained through multiple platforms, increase the understanding of the effect of those interactions on the chemosensitivity of cancer, and create tools that will facilitate that process. Translation of that information will be directed towards the recognition of diagnostic and therapeutic cancer biomarkers, and directed cancer therapy.

If you are interested in bioinformatics and the type of tools currently in use, this is a good place to start.

Questions:

  1. Choose one of the tools. What subject identity test(s) are implicit in the tool? (3-5 pages, no citations)
  2. Can the results or data for the tool be easily mapped to professional literature? Why/Why not? (3-5 pages, no citations)
  3. How does the identity of results differ from the identity of data? If it does? (3-5 pages, no citations)

Virtuoso Open-Source Edition

Filed under: Linked Data,RDF,Semantic Web,Software — Patrick Durusau @ 7:06 am

Virtuoso Open-Source Edition

I ran across Virtuoso while running down the references in the article on SIREn. (Yes, I check references, not all of them, just the most interesting ones, as time permits.)

Has partial support for a variety of “Semantic Web” technologies.

Is the basis for OpenLink Data Spaces.

A named structured data cluster within a distributed data network where each item of data (each “datum”) has a unique identifier. Fundamental characteristics of data spaces include:

  • Each Data Item (or Entity) is endowed with a unique HTTP-based Identifier
  • Entity Identity, Access, and Representation are each distinct from the others
  • Entities are interlinked via attributes and relationship properties
  • Creation, Update, and Deletion privileges are controlled by the space owner

I can think of lots of “data spaces,” Large Hadron Collider data, radio and optical astronomy data dumps, TCP/IP data streams, bioinformatics data, commercial transaction databases that don’t fit this description. Please submit your own.

Still, if you want to learn the ins and outs as well as the limitations of this approach, it costs nothing more than the time to download the software.

A Node Indexing Scheme for Web Entity Retrieval

Filed under: Entity Resolution,Full-Text Search,Indexing,Lucene,RDF,Topic Maps — Patrick Durusau @ 6:15 am

A Node Indexing Scheme for Web Entity Retrieval Authors(s): Renaud Delbru, Nickolai Toupikov, Michele Catasta, Giovanni Tummarello Keywords: entity, entity search, full-text search, semi-structured queries, top-k query, node indexing, incremental index updates, entity retrieval system, RDF, RDFa, Microformats

Abstract:

Now motivated also by the partial support of major search engines, hundreds of millions of documents are being published on the web embedding semi-structured data in RDF, RDFa and Microformats. This scenario calls for novel information search systems which provide effective means of retrieving relevant semi-structured information. In this paper, we present an “entity retrieval system” designed to provide entity search capabilities over datasets as large as the entire Web of Data. Our system supports full-text search, semi-structural queries and top-k query results while exhibiting a concise index and efficient incremental updates. We advocate the use of a node indexing scheme and show that it offers a good compromise between query expressiveness, query processing time and update complexity in comparison to three other indexing techniques. We then demonstrate how such system can effectively answer queries over 10 billion triples on a single commodity machine.

Consider the requirements for this project:

  1. Support for the multiple formats which are used on the Web of Data;
  2. Support for searching an entity description given its characteristics (entity centric search);
  3. Support for context (provenance) of information: entity descriptions are given in the context of a website or a dataset;
  4. Support for semi-structural queries with full-text search, top-k query results, scalability over shard clusters of commodity machines, efficient caching strategy and incremental index maintenance.
    1. (emphasis added)

SIREn { Semantic Information Retrieval Engine }

Definitely a package to download, install and start to evaluate. More comments forthcoming.

Questions (more for topic map researchers)

  1. To what extent can “entity description” = properties of topics, associations, occurrences?
  2. Can XTM, etc., be regarded as “microformats” for the purposes of SIREn?
  3. To what extent does SIREn meet or exceed query requirements for XTM/TMDM based topic maps?
  4. Reports on use of SIREn by topic mappers?

November 24, 2010

TMRM and a “universal information space”

Filed under: Subject Identity,TMDM,TMRM,Topic Maps — Patrick Durusau @ 7:58 pm

As an editor of the TMRM (Topic Maps Reference Model) I feel compelled to point out the TMRM is not a universal information space.

I bring up the universal issue because someone mentioned lately, mapping to the TMRM.

There is a lot to say about the TMRM but let’s start with the mapping issue.

There is no mapping to the TMRM. (full stop) The reason is that the TMRM is also not a data model. (full stop)

There is a simple reason why the TMRM was not, is not, nor ever will be a data model or universal information space.

There is no universal information space or data model.

Data models are an absolute necessity and more will be invented tomorrow.

But, to be a data model is to govern some larger or smaller slice of data.

We want to meaningfully access information across past, present and future data models in different information spaces.

Enter the TMRM, a model for disclosure of the subjects represented by a data model. Any data model, in any information space.

A model for disclosure, not a methodology, not a target, etc.

We used key and value because a key/value pair is the simplest expression of a property class.

The representative of the definition of a class (the key) and an instance of that class (the value).

That does not constrain or mandate any particular data model or information space.

Rather than mapping to the TMRM, we should say mapping using the principles of the TMRM.

I will say more in a later post, but for example, what subject does a topic represent?

With disclosure for the TMDM and RDF, we might not agree on the mapping, but it would be transparent. And useful.

Text Visualization for Visual Text Analytics

Filed under: Authoring Topic Maps,Text Analytics,Visualization — Patrick Durusau @ 7:32 pm

Text Visualization for Visual Text Analytics Authors: John Risch, Anne Kao, Stephen R. Poteet and Y. J. Jason Wu

Abstract:

The term visual text analytics describes a class of information analysis techniques and processes that enable knowledge discovery via the use of interactive graphical representations of textual data. These techniques enable discovery and understanding via the recruitment of human visual pattern recognition and spatial reasoning capabilities. Visual text analytics is a subclass of visual data mining / visual analytics, which more generally encompasses analytical techniques that employ visualization of non-physically-based (or “abstract”) data of all types. Text visualization is a key component in visual text analytics. While the term “text visualization” has been used to describe a variety of methods for visualizing both structured and unstructured characteristics of text-based data, it is most closely associated with techniques for depicting the semantic characteristics of the free-text components of documents in large document collections. In contrast with text clustering techniques which serve only to partition text corpora into sets of related items, these so-called semantic mapping methods also typically strive to depict detailed inter- and intra-set similarity structure. Text analytics software typically couples semantic mapping techniques with additional visualization techniques to enable interactive comparison of semantic structure with other characteristics of the information, such as publication date or citation information. In this way, value can be derived from the material in the form of multidimensional relationship patterns existing among the discrete items in the collection. The ultimate goal of these techniques is to enable human understanding and reasoning about the contents of large and complexly related text collections.

Not the latest word in the area but a useful survey of the issues that arise in text visualization.

Text visualization is important for the creation of topic maps as well as the viewing of information discovered by use of a topic map.

Questions:

  1. Update the bibliography of this paper for the techniques discussed.
  2. Are there new text visualization techniques?
  3. How would you use the techniques in this paper or newer ones, for authoring topic maps? (3-5 pages, citations)

R-Bloggers

Filed under: Data Mining,R — Patrick Durusau @ 4:47 pm

R-Bloggers: R news contributed by (~130) R bloggers

R is “software environment for statistical computing and graphics.” (so the project page says)

Used extensively in a number of fields for data mining, exploration and display.

R-Bloggers as the name implies, is a blog site contributed to by a number of R users.

Questions:

  1. Bibliography of use of R in library projects.
  2. Use R for exploring data set for building a topic map. (Project)
  3. Bibliography of use of R in particular subject area.

IRODS

Filed under: Astroinformatics,Software,Space Data — Patrick Durusau @ 2:49 pm

IRODS:Data Grids, Digital Libraries, Persistent Archives, and Real-time Data Systems

From the website:

iRODS™, the Integrated Rule-Oriented Data System, is a data grid software system developed by the Data Intensive Cyber Environments research group (developers of the SRB, the Storage Resource Broker), and collaborators. The iRODS system is based on expertise gained through a decade of applying the SRB technology in support of Data Grids, Digital Libraries, Persistent Archives, and Real-time Data Systems. iRODS management policies (sets of assertions these communities make about their digital collections) are characterized in iRODS Rules and state information. At the iRODS core, a Rule Engine interprets the Rules to decide how the system is to respond to various requests and conditions. iRODS is open source under a BSD license. (emphasis in original)

Provides an umbrella over data sources to presents a uniform view to users.

The rules and metadata don’t appear to be as granular as one expects with topic maps.

I mention it here because of its use/importance with space data and as a current research platform into sharing data.

Questions:

  1. Current and annotated bibliography for the project.
  2. What are the main strengths/weaknesses of this approach? (3-5 pages, citations)

Text Analysis with LingPipe 4. Draft 0.2

Filed under: Data Mining,Natural Language Processing,Text Analytics — Patrick Durusau @ 9:53 am

Text Analysis with LingPipe 4. Draft 0.2

Draft 0.2 is up to 363 pages.

Chapters:

  1. Getting Started
  2. Characters and Strings
  3. Regular Expressions
  4. Input and Output
  5. Handlers, Parsers, and Corpora
  6. Classifiers and Evaluation
  7. Naive Bayes Classifiers (not done)
  8. Tokenization
  9. Symbol Tables
  10. Sentence Boundary Detection (not done)
  11. Latent Dirichlet Allocation
  12. Singular Value Decomposition (not done)

Extensive annexes.

Projected to see another 1,000 or so pages. So the (not done) chapters will appear along with additional material in other chapters.

Readers welcome!

Christmas came early this year!

Questions:

  1. Class presentation demonstrating use of one of the techniques on library related data set.
  2. Compare and contrast two of the techniques on a library related data set. (Project)
  3. Annotated and updated bibliography for any chapter.

Update: Same questions as before but look at the updated version of the book (split into text processing and NLP as separate parts): LingPipe and Text Processing Books.

Strange Maps

Filed under: Mapping,Maps — Patrick Durusau @ 9:47 am

Strange Maps

From the website:

Frank Jacobs loves maps, but finds most atlases too predictable. He collects and comments on all kinds of intriguing maps—real, fictional, and what-if ones—and has been writing the Strange Maps blog since 2006, first on WordPress and now for Big Think.

I mention this because maps are often seen as depicting the way things are.

I prefer to think of maps, including maps of subjects, as useful for particular purposes.

That isn’t quite the same thing.

Questions:

  1. What basis for comparison/representation would you like to see used for states or counties? (discussion)
  2. What do you think that would show/demonstrate differently from standard maps? (discussion)
  3. Suggest data sources for creating such a representation. (3-5 pages, citations)

November 23, 2010

Dataists – Blog

Filed under: Data Mining — Patrick Durusau @ 7:21 pm

Dataists

A blog for data hackers.

Data mining is something that underlies ever topic map of any size.

Impressive analysis of the Afghan War Diaries.

You may have heard about them and their being posted elsewhere.

Open Bibliographic Working Group

Filed under: Bibliography,British National Bibliography — Patrick Durusau @ 9:54 am

Open Bibliographic Working Group

The group responsible for processing the British National Bibliography.

That was under the JISC OpenBibliography project.

They have several other projects I need to mention here.

If you are interested in bibliographic data, this is one group to follow and if you are able, please contribute to their efforts.

ICCS’11: Conceptual Structures for Discovering Knowledge

Filed under: Conferences,Knowledge Management — Patrick Durusau @ 9:45 am

ICCS’11: Conceptual Structures for Discovering Knowledge 25th – 29th July, University of Derby, United Kingdom

From the announcement:

The 19th International Conference on Conceptual Structures (ICCS 2011) is the latest in a series of annual conferences that have been held in Europe, Asia, Australia, and North America since 1993. The focus of these conferences has been the representation and analysis of conceptual knowledge for research and business applications. ICCS brings together researchers in information technology, arts, humanities and social science to explore novel ways that can conceptual structures can be employed in information systems.
….

ICCS 2011’s theme is “Conceptual Structures for Discovering Knowledge”. More and more data is being captured in electronic format (particularly through the Web) and it is emerging that this data is reaching such a critical mass that it is becoming the most recorded form of the world around us. It now represents our business, economic, arts, social, and scientific endeavours to such an extent that we require smart applications that can discover the hitherto hidden knowledge that this mass of data is busily capturing. By bringing together the way computers work with the way humans think, conceptual structures align the productivity of computer processing with the ingenuity of individuals and organisations in a meaningful digital future.

Important Dates:

  • Friday 14 January 2011 – a one page abstract submitted via conference website ( www.iccs.info) NB: Abstracts should clearly state the purpose, results and conclusions of the work to be described in the final paper.
  • Friday 21 January 2011 – full paper in PDF format submitted via the conference website ( www.iccs.info)

BTW, the dates are correct, one week gap between abstracts and full papers. I checked with the conference organizers. They use the abstracts to plan allocation of papers to reviewers.

Querying the British National Bibliography

Filed under: British National Bibliography,Dataset,RDF,Semantic Web,SPARQL — Patrick Durusau @ 9:40 am

Querying the British National Bibliography

From the webpage:

Following up on the earlier announcement that the British Library has made the British National Bibliography available under a public domain dedication, the JISC Open Bibliography project has worked to make this data more useable.

The data has been loaded into a Virtuoso store that is queriable through the SPARQL Endpoint and the URIs that we have assigned each record use the ORDF software to make them dereferencable, supporting perform content auto-negotiation as well as embedding RDFa in the HTML representation.

The data contains some 3 million individual records and some 173 million triples. …

The data is also available for local processing but it isn’t much of a “web” if the first step is to always download a local copy of the data.

It should be interesting to watch for projects that combine the results of queries against this data with the results of other queries against other data sets. Particularly if those other data sets follow different metadata regimes.

Isn’t that the indexing problem all over again?

Questions:

  1. What data set would you want to combine with British National Bibliography (BNB)?
  2. What issues do you see arising from combing the BNB with your data set? (3-5 pages, no citations)
  3. Combining the BNB with another data set. (project)

Mining Interesting Subgraphs….

Filed under: Data Mining,Sampling,Subgraphs — Patrick Durusau @ 7:00 am

Mining Interesting Subgraphs by Output Space Sampling

Mohammad Al Hasan’s dissertation was the winner of the SIGKDD Ph.D. Dissertation Award.

From the dissertation:

Output space sampling is an entire paradigm shift in frequent pattern mining (FPM) that holds enormous promise. While traditional FPM strives for completeness, OSS targets to obtain a few interesting samples. The definition of interestingness can be very generic, so user can sample patterns from different target distributions by choosing different interestingness functions. This is very beneficial as mined patterns are subject to subsequent use in various knowledge discovery tasks, like classification, clustering, outlier detection, etc. and the interestingness score of a pattern varies for various tasks. OSS can adapt to this requirement just by changing the interestingness function. OSS also solves pattern redundancy problem by finding samples that are very different from each other. Note that, pattern redundancy hurts any knowledge based system that builds metrics based on the structural similarity of the patterns.

Nice to see recognition that for some data sets we don’t need (or require) full enumeration of all occurrences.

Something topic map advocates need to remember when proselytizing for topic maps.

The goal is not all the information known about a subject.

The goal is all the information a user wants about a subject.

Not the same thing.

Questions:

  1. What criteria of “interestingness” would you apply in gathering data for easy access by your patrons? (3-5 pages, no citations)
  2. How would you use this technique for authoring and/or testing a topic map? (3-5 pages, not citations. Think of “testing” a topic map as its representativeness of a particular data collection.
  3. Bibliography of material citing the paper or applying this technique.

November 22, 2010

Minimum Description Length (MDL)

Filed under: Data Mining,Minimum Description Length,Pattern Compression — Patrick Durusau @ 8:36 am

mdl-research.org

From the website:

The purpose of statistical modeling is to discover regularities in observed data. The success in finding such regularities can be measured by the length with which the data can be described. This is the rationale behind the Minimum Description Length (MDL) Principle introduced by Jorma Rissanen (Rissanen, 1978).

” The MDL Principle is a relatively recent method for inductive inference. The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally. ” (Grünwald, 1998)

The website offers a reading list on MDL, demonstrations (with links to software), a list of researchers, related topics and upcoming conferences.

Pattern Compression – 7 Magnitudes of Reduction

Filed under: Data Mining,Minimum Description Length,Pattern Compression — Patrick Durusau @ 8:14 am

Making Pattern Mining Useful.

Jilles Vreeken’s dissertation was a runner-up for the 2010 ACM SIGKDD Dissertation Award.

Vreeken proposes “compression” of data patterns on the basis of Minimum Description Length (MDL) (see The Minimum Description Length Principle) and KRIMP, “a heuristic parameter-free algorithm for finding the optimal set of frequent itemsets.” (SIGKDD, vol. 12, issue 1, page 76)

Readers should take note that experience indicates that KRIMP achieves 7 magnitudes of reduction in patterns. Let me say that again: KRIMP achieves 7 magnitudes of reduction in patterns. In practice, not theory.

Vreeken’s homepage has other materials of interest on this topic.

Questions:

  1. Application of “minimum description length” in library science? (report for class)
  2. How would you apply “minimum description length” techniques in library science? (3-5 pages, citations)
  3. Introduction to “Minimum Description Length For Librarians (class presentation, examples relevant to librarians)

TxtAtlas

Filed under: Information Retrieval,Interface Research/Design,Mapping — Patrick Durusau @ 7:08 am

TxtAtlas

First noticed on Alex Popescu’s blog.

Text a phone number and then it appears as an entry on a map.

I have an uneasy feeling this may be important.

Not this particular application but the ease of putting content from dispersed correspondents together into a single map.

I wonder if instead of distance the correspondents could be dispersed over time? Say as users of a document archive?*

Questions:

  1. How would you apply these techniques to a document archive? (3-5 pages, no citations)
  2. How would you adapt the mapping of a document archive based on user response? (3-5 pages, no citations)
  3. Design an application of this technique for a document archive. (Project)

*Or for those seeking more real-time applications, imagine GPS coordinates + status updates from cellphones on a more detailed map. Useful for any number of purposes.

A Term Association Inference Model for Single Documents:….

Filed under: Data Mining,Document Classification,Information Retrieval,Summarization — Patrick Durusau @ 6:36 am

A Term Association Inference Model for Single Documents: A Stepping Stone for Investigation through Information Extraction Author(s): Sukanya Manna and Tom Gedeon Keywords: Information retrieval, investigation, Gain of Words, Gain of Sentences, term significance, summarization

Abstract:

In this paper, we propose a term association model which extracts significant terms as well as the important regions from a single document. This model is a basis for a systematic form of subjective data analysis which captures the notion of relatedness of different discourse structures considered in the document, without having a predefined knowledge-base. This is a paving stone for investigation or security purposes, where possible patterns need to be figured out from a witness statement or a few witness statements. This is unlikely to be possible in predictive data mining where the system can not work efficiently in the absence of existing patterns or large amount of data. This model overcomes the basic drawback of existing language models for choosing significant terms in single documents. We used a text summarization method to validate a part of this work and compare our term significance with a modified version of Salton’s [1].

Excellent work that illustrates how re-thinking of fundamental assumptions of data mining can lead to useful results.

Questions:

  1. Create an annotated bibliography of citations to this article.
  2. Citations of items in the bibliography since this paper (2008)? List and annotate.
  3. How would you use this approach with a document archive project? (3-5 pages, no citations)

A Fun Application of Compact Data Structures to Indexing Geographic Data

Filed under: Geographic Information Retrieval,Indexing,Spatial Index — Patrick Durusau @ 6:07 am

A Fun Application of Compact Data Structures to Indexing Geographic Data Author(s): Nieves R. Brisaboa, Miguel R. Luaces, Gonzalo Navarro, Diego Seco Keywords: geographic data, MBR, range query, wavelet tree

Abstract:

The way memory hierarchy has evolved in recent decades has opened new challenges in the development of indexing structures in general and spatial access methods in particular. In this paper we propose an original approach to represent geographic data based on compact data structures used in other fields such as text or image compression. A wavelet tree-based structure allows us to represent minimum bounding rectangles solving geographic range queries in logarithmic time. A comparison with classical spatial indexes, such as the R-tree, shows that our structure can be considered as a fun, yet seriously competitive, alternative to these classical approaches.

I must confess that after reading this article more than once, I still puzzle over: “Our experiments, featuring GIS-like scenarios, show that our index is a relevant and funnier alternative to classical spatial indexes, such as the R-tree ….”

I admit to being drawn to esoteric and even odd solutions but I would not describe most of them as being “funnier” than an R-tree.

For all that, the article will be useful to anyone developing topic maps for use with spatial indexes and is a good introduction to wavelet trees.

Questions:

  1. Create an annotated bibliography of spatial indexes. (date limit, last five (5) years)
  2. Create an annotated bibliography of spatial data resources. (date limit, last five (5) years)
  3. How would you use MBRs (Minimum Bounding Rectangles) for merging purposes in a topic map? (3-5 pages, no citations)

November 21, 2010

Measuring the meaning of words in contexts:…

Filed under: Ambiguity,Co-Words,Collocation,Diaphors,Metaphors,Semantics — Patrick Durusau @ 11:30 am

Measuring the meaning of words in contexts: An automated analysis of controversies about ‘Monarch butterflies,’ ‘Frankenfoods,’ and ‘stem cells’ Author(s): Loet Leydesdorff and Iina Hellsten Keywords: co-words, metaphors, diaphors, context, meaning

Abstract:

Co-words have been considered as carriers of meaning across different domains in studies of science, technology, and society. Words and co-words, however, obtain meaning in sentences, and sentences obtain meaning in their contexts of use. At the science/society interface, words can be expected to have different meanings: the codes of communication that provide meaning to words differ on the varying sides of the interface. Furthermore, meanings and interfaces may change over time. Given this structuring of meaning across interfaces and over time, we distinguish between metaphors and diaphors as reflexive mechanisms that facilitate the translation between contexts. Our empirical focus is on three recent scientific controversies: Monarch butterflies, Frankenfoods, and stem-cell therapies. This study explores new avenues that relate the study of co-word analysis in context with the sociological quest for the analysis and processing of meaning.

Excellent article on shifts of word meaning over time. Reports sufficient detail on methodology that interested readers will be able to duplicate or extend the research reported here.

Questions:

  1. Annotated bibliography of research citing this paper.
  2. Design a study of the shifting meaning of a 2 or 3 terms. What texts would you select? (3-5 pages, with citations)
  3. Perform a study of shifting meaning of terms in library science. (Project)

Text Analysis Conference (TAC)

Filed under: Conferences,Knowledge Base Population,Summarization,Textual Entailment — Patrick Durusau @ 11:01 am

Text Analysis Conference (TAC)

From the website:

The Text Analysis Conference (TAC) is a series of evaluation workshops organized to encourage research in Natural Language Processing and related applications, by providing a large test collection, common evaluation procedures, and a forum for organizations to share their results. TAC comprises sets of tasks known as “tracks,” each of which focuses on a particular subproblem of NLP. TAC tracks focus on end-user tasks, but also include component evaluations situated within the context of end-user tasks.

  • Knowledge Base Population

    The goal of the Knowledge Base Population track is to develop systems that can augment an existing knowledge representation (based on Wikipedia infoboxes) with information about entities that is discovered from a collection of documents.

  • Recognizing Textual Entailment

    The goal of the RTE Track is to develop systems that recognize when one piece of text entails another.

  • Summarization

    The goal of the Summarization Track is to develop systems that produce short, coherent summaries of text.

Sponsored by the U.S. Department of Defense.

Rumor has it that one intelligence analysis group won a DoD contract without hiring an ex-general. If you get noticed by a prime contractor here, perhaps you won’t have to either. The primes have lots of ex-generals/colonels, etc.

Questions:

  1. Select a paper from one of the TAC conferences. Update on the status of that research. (3-5 pages, citations)
  2. For the authors of #1, annotated bibliography of publications since the paper.
  3. How would you use the technique from #1 in the construction of a topic map? Inform your understanding, selection, data for that map, etc.? (3-5 pages, no citations)

(Yes, I stole the questions from my DUC conference posting. ;-))

DUC: Document Understanding Conferences

Filed under: Conferences,Data Source,Summarization — Patrick Durusau @ 8:19 am

DUC: Document Understanding Conferences

From the website:

There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA’s TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA’s Advanced Question & Answering Program and NIST’s TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.

Within TIDES and among other researchers interested in document understanding, a group grew up which has been focusing on summarization and the evaluation of summarization systems. Part of the initial evaluation for TIDES called for a workshop to be held in the fall of 2000 to explore different ways of summarizing a common set of documents. Additionally a road mapping effort was started in March of 2000 to lay plans for a long-term evaluation effort in summarization.

Data sets, papers, etc., on text summarization.

Yes, DUC has moved to Textual Analysis Conference (TAC) but what they don’t say is that the DUC data and papers for 2001 to 2007 are listed at this site only.

Something to remember when you are looking for text summarization data sets and research.

Questions:

  1. Select a paper from the 2007 DUC conference. Update on the status of that research. (3-5 pages, citations)
  2. For the authors of #1, annotated bibliography of publications since the paper in 2007.
  3. How would you use the technique from #1 in the construction of a topic map? Inform your understanding, selection, data for that map, etc.? (3-5 pages, no citations)

Ontology Based Graphical Query Language Supporting Recursion

Filed under: Ontology,Query Language,Semantic Web,Visual Query Language — Patrick Durusau @ 7:55 am

Ontology Based Graphical Query Language Supporting Recursion Author(s): Arun Anand Sadanandan, Kow Weng Onn and Dickson Lukose Keywords: Visual Query Languages, Visual Query Systems, Visual Semantic Query, Graphical Recursion, Semantic Web, Ontologies

Abstract:

Text based queries often lead tend to be complex, and may result in non user friendly query structures. However, querying information systems using visual means, even for complex queries has proven to be more efficient and effective as compared to text based queries. This is owing to the fact that visual systems make way for better human-computer communication. This paper introduces an improved query system using a Visual Query Language. The system allows the users to construct query graphs by interacting with the ontology in a user friendly manner. The main purpose of the system is to enable efficient querying on ontologies even by novice users who do not have an in-depth knowledge of internal query structures. The system also supports graphical recursive queries and methods to interpret recursive programs from these visual query graphs. Additionally, we have performed some preliminary usability experiments to test the efficiency and effectiveness of the system.

From the abstract I was expecting visual representation of the subjects that form the query. The interface remains abstract but is a good step in the direction of a more useful query interface for the non-expert. (Which we all are in some domain.)

Questions:

  1. Compare to your experience with query language interfaces. (3-5 pages, no citations)
  2. Are recursive queries important for library catalogs? (3-5 pages, no citations, but use examples to make your case, pro or con)
  3. Suggestions for a visual query language for the current TMQL draft? (research project)

November 20, 2010

From Documents To Targets: Geographic References

Filed under: Associations,Geographic Information Retrieval,Ontology,Spatial Index — Patrick Durusau @ 9:18 pm

Exploiting geographic references of documents in a geographical information retrieval system using an ontology-based index Author(s): Nieves R. Brisaboa, Miguel R. Luaces, Ángeles S. Places and Diego Seco Keywords: Geographic information retrieval, Spatial index, Textual index, Ontology, System architecture

Abstract:

Both Geographic Information Systems and Information Retrieval have been very active research fields in the last decades. Lately, a new research field called Geographic Information Retrieval has appeared from the intersection of these two fields. The main goal of this field is to define index structures and techniques to efficiently store and retrieve documents using both the text and the geographic references contained within the text. We present in this paper two contributions to this research field. First, we propose a new index structure that combines an inverted index and a spatial index based on an ontology of geographic space. This structure improves the query capabilities of other proposals. Then, we describe the architecture of a system for geographic information retrieval that defines a workflow for the extraction of the geographic references in documents. The architecture also uses the index structure that we propose to solve pure spatial and textual queries as well as hybrid queries that combine both a textual and a spatial component. Furthermore, query expansion can be performed on geographic references because the index structure is based in an ontology.

Obviously relevant to the Afghan War Diary materials.

The authors observe:

…concepts such as the hierarchical nature of geographic space and the topological relationships between the
geographic objects must be considered….

Interesting but topic maps would help with “What defensive or offensive assets I have in a geographic area?”

Associations: The Kind They Pay For

Filed under: Associations,Authoring Topic Maps,Data Mining,Data Structures — Patrick Durusau @ 4:56 pm

Fun at a Department Store: Data Mining Meets Switching Theory Author(s): Anna Bernasconi, Valentina Ciriani, Fabrizio Luccio, Linda Pagli Keywords: SOP, Implicants, Data Mining, Frequent Itemsets, Blulife

Abstract:

In this paper we introduce new algebraic forms, SOP +  and DSOP + , to represent functions f:{0,1}n → ℕ, based on arithmetic sums of products. These expressions are a direct generalization of the classical SOP and DSOP forms.

We propose optimal and heuristic algorithms for minimal SOP +  and DSOP +  synthesis. We then show how the DSOP +  form can be exploited for Data Mining applications. In particular we propose a new compact representation for the database of transactions to be used by the LCM algorithms for mining frequent closed itemsets.

A new technique for extracting associations between items present (or absent) in transactions (sales transactions).

Of interest to people with the funds to pay for data mining and topic maps.

Topic maps are useful to bind the mining of such associations to other information systems, such as supply chains.

Questions:

  1. How would you use data mining of transaction associations to guide collection development? (3-5 pages, with citations)
  2. How would you use topic maps with the mining of transaction associations? (3-5 pages, no citations)
  3. How would you bind an absence of data to other information? (3-5 pages, no citations)

Observation: Intelligence agencies recognize the absence of data as an association. Binding that absence to other date is a job for topic maps.

Subjective Logic = Effective Logic?

Capture of Evidence for Summarization: An Application of Enhanced Subjective Logic

Authors(s): Sukanya Manna, B. Sumudu U. Mendis, Tom Gedeon Keywords: subjective logic, opinions, evidence, events, summarization, information extraction

Abstract:

In this paper, we present a method to generate an extractive summary from a single document using subjective logic. The idea behind our approach is to consider words and their co-occurrences between sentences in a document as evidence of their relatedness to the contextual meaning of the document. Our aim is to formulate a measure to find out ‘opinion’ about a proposition (which is a sentence in this case) using subjective logic in a closed environment (as in a document). Stronger opinion about a sentence represents its importance and are hence considered to summarize a document. Summaries generated by our method when evaluated with human generated summaries, show that they are more similar than baseline summaries.

The authors justify their use of “subjective” logic by saying:

pointed out that a given piece of text is interpreted by different person in a different fashion especially in the way how they understand and interpret the context. Thus we see that human understanding and reasoning is subjective in nature unlike propositional logic which deals with either truth or falsity of a statement. So, to deal with this kind of situation we used subjective logic to find out sentences which are significant in the context and can be used to summarize a document.

“Subjective” logic means we are more likely to reach the same result as a person reading the text.

Search results as used and evaluated by people.

That sounds like effective logic to me.

Questions:

  1. Read the Audun Jøsang’s article Artificial Reasoning with Subjective Logic.
  2. Summarize three (3) applications (besides the article above) of “subjective” logic. (3-5 pages, citations)
  3. How do you think “subjective” logic should be modeled in topic maps? (3-5 pages, citations optional)

Classification and Pattern Discovery of Mood in Weblogs

Filed under: Classification,Clustering,Pattern Recognition — Patrick Durusau @ 10:18 am

Classification and Pattern Discovery of Mood in Weblogs Authors(s): Thin Nguyen, Dinh Phung, Brett Adams, Truyen Tran, Svetha Venkatesh

Abstract:

Automatic data-driven analysis of mood from text is an emerging problem with many potential applications. Unlike generic text categorization, mood classification based on textual features is complicated by various factors, including its context- and user-sensitive nature. We present a comprehensive study of different feature selection schemes in machine learning for the problem of mood classification in weblogs. Notably, we introduce the novel use of a feature set based on the affective norms for English words (ANEW) lexicon studied in psychology. This feature set has the advantage of being computationally efficient while maintaining accuracy comparable to other state-of-the-art feature sets experimented with. In addition, we present results of data-driven clustering on a dataset of over 17 million blog posts with mood groundtruth. Our analysis reveals an interesting, and readily interpreted, structure to the linguistic expression of emotion, one that comprises valuable empirical evidence in support of existing psychological models of emotion, and in particular the dipoles pleasure-displeasure and activation-deactivation.

The classification and pattern discovery of sentiment in weblogs will be a high priority for some topic maps.

Detection of teenagers who post to MySpace about violence for example.

Questions:

  1. How would you use this technique for research on weblogs? (3-5 pages, no citations)
  2. What other word lists could be applied to research on weblogs? Thoughts on how they could be applied? (3-5 pages, citations)
  3. Does the “mood” of a text impact its classification in traditional schemes? How would you test that question? (3-5 pages, no citations)

Additional resources:

Affective Norms for English Words (ANEW) Instruction Manual and Affective Ratings

ANEW Message: Request form for ANEW word list.

« Newer PostsOlder Posts »

Powered by WordPress