Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

March 15, 2011

Expiring columns

Filed under: Cassandra,NoSQL — Patrick Durusau @ 5:08 am

Expiring columns

In Cassandra 0.7, there are expiring columns.

From the blog:

Sometimes, data comes with an expiration date, either by its nature or because it’s simply intractable to keep all of a rapidly growing dataset indefinitely.

In most databases, the only way to deal with such expiring data is to write a job running periodically to delete what is expired. Unfortunately, this is usually both error-prone and inefficient: not only do you have to issue a high volume of deletions, but you often also have to scan through lots of data to find what is expired.

Fortunately, Cassandra 0.7 has a better solution: expiring columns. Whenever you insert a column, you can specify an optional TTL (time to live) for that column. When you do, the column will expire after the requested amount of time and be deleted auto-magically (though asynchronously — see below). Importantly, this was designed to be as low-overhead as possible.

Now there is an interesting idea!

Goes along with the idea that a topic map does not (should not?) present a timeless view of information. That is a topic map should maintain state so that we can determine what was known at any particular time.

Take a simple example, a call for papers for a conference. It could be that a group of conferences all share the same call for papers, the form, submission guidelines, etc. And that call for papers is associated with each conference by an association.

Shouldn’t we be able to set an expiration date on that association so that at some point in time, all those facilities are no longer available for that conference? Perhaps it switches over to another set of properties in the same association to note that the submission dates have passed? That would remove the necessity for the association expiring.

But there are cases where associations do expire or at least end. Divorce in an unhappy example. Being hired is a happier one.

Something to think about.

Getting Started with CouchDB

Filed under: CouchDB,NoSQL — Patrick Durusau @ 5:07 am

Getting Started with CouchDB

A tutorial introduction to CouchDB.

Fairly brief but covers most of the essentials.

Note to self: Would not be a bad model for a topic map tutorial introduction.

Redis, from the ground up

Filed under: NoSQL,Redis — Patrick Durusau @ 5:04 am

Redis, from the ground up

Mark J. Russo:

A deep dive into Redis’ origins, design decisions, feature set, and a look at a few potential applications.

Not all that you would want to know about Redis but enough to develop an appetite for more!

March 14, 2011

Topincs in the South-West

Filed under: News,Topic Map Software,Topic Maps — Patrick Durusau @ 8:58 am

Topincs in the South-West

Robert Cerny, author of Topincs:

Topincs, my software for rapid development of web databases. These ‘formalized Wikis’ use forms instead of Wiki-Markup and enable people with common domain knowledge to collaboratively edit data with a web browser in the office or on the road. There is no special technical skill required to participate in such a digital shared memory. A Topincs web database can be set up in little time without programming and can be extended on demand. It offers a generic data viewing and editing approach which lifts the wiki idea to structured data. For the same data a generic domain specific programming interface becomes available. It uses Topic Maps as its core technology.

Last two weeks in April 2011

I am sure Robert’s dance card is going to fill up quickly so:

Contact form: http://www.cerny-online.com/topincs/consulting

Email: robert@cerny-online.com

Let’s join together to make Robert’s trip an enjoyable and successful one.

Remember, a rising tide lifts all boats!

User Interface

Filed under: Interface Research/Design — Patrick Durusau @ 8:06 am

User Interface

From the website:

This is a collaboratively edited question and answer site for user interface researchers and experts. It’s 100% free, no registration required.

Another Q/A site.

Not sure of its immediate use to topic map interface design.

Curious what lessons we can draw from the web portal delivery of topic map content?

That is, were the web portal systems designed more as web interfaces than interfaces to topic maps?

If so, what would be the difference between the two?

Comments?

HTML5 and Topic Maps

Filed under: Interface Research/Design — Patrick Durusau @ 8:00 am

The demonstration at:

Julia Map

makes HTML5 look like a contender for topic map interfaces.*

Some other resources that may be of interest:

Dive into HTML5

HTML5Rocks

Wikipedia entry for HMTL5

(X)HTML 5.0 Validator

(For the course: Offer extra credit for projects that use HTML5, with imagination. Duplicating what we can do now is awarded no points.)

*****
*Thinking in terms of the calculations necessary for some visualizations. Well, that and calculations of subject identity if you are inclined in that direction.

SimHash – Depends on Where You Start

Filed under: Duplicates — Patrick Durusau @ 7:58 am

I was reading Detecting Near-Duplicates for Web Crawling when I ran across the following requirement:

Near-Duplicate Detection

Why is it hard in a crawl setting?

  • Scale
    • Tens of billions of documents indexed
    • Millions of pages crawled every day
  • Need to decide quickly!

This presentation and SimHash: Hash-based Similarity Detection are both of interest to the topic maps community, since your near-duplicate may be my same subject.

But the other aspect of this work that caught my eye was the starting presumption that near-duplicate detection always occurs under extreme conditions.

Questions:

  1. Do my considerations change if I have only a few hundred thousand documents? (3-5 pages, no citations)
  2. What similarity tests are computationally too expensive for millions/billions but that work for hundred’s of thousands? (3-5 pages, no citations)
  3. How would you establish empirically the break point for the application of near-duplicate techniques? (3-5 pages, no citations)
  4. Establish the break points for selected near-duplicate measures. (project)
  5. Analysis of near-duplicate measures. What accounts for the different in performance? (project)

Groovy

Filed under: Domain-Specific Languages,Groovy,Java — Patrick Durusau @ 7:56 am

Groovy

I am particularly interested in Groovy’s support for Domain-Specific Languages.

It occurs to me that providing users with a domain-specific language is very close to issues that surround the design of interfaces for users.

That is you don’t write a “domain-specific language” and then expect others to use it. Well, you could but uptake might be iffy.

Rather the development of a “domain-specific language” is done with subject matter experts and their views are incorporated into the language.

Sounds like that might be an interesting approach to authoring topic maps in some contexts.

From the website:

  • is an agile and dynamic language for the Java Virtual Machine
  • builds upon the strengths of Java but has additional power features inspired by languages like Python, Ruby and Smalltalk
  • makes modern programming features available to Java developers with almost-zero learning curve
  • supports Domain-Specific Languages and other compact syntax so your code becomes easy to read and maintain
  • makes writing shell and build scripts easy with its powerful processing primitives, OO abilities and an Ant DSL
  • increases developer productivity by reducing scaffolding code when developing web, GUI, database or console applications
  • simplifies testing by supporting unit testing and mocking out-of-the-box
  • seamlessly integrates with all existing Java classes and libraries
  • compiles straight to Java bytecode so you can use it anywhere you can use Java

Questions:

  1. What areas of library activity already have Domain-Specific Languages, albeit not in executable computer syntaxes?
  2. Which ones do you think would benefit from the creation of an executable Domain-Specific Language?
  3. How would you use topic maps to document such a Domain-Specific Language?
  4. How would your topic map record changing interpretations over time for apparently constant terms?

MathJax

Filed under: News — Patrick Durusau @ 7:53 am

MathJax

From the website:

MathJax is an open source JavaScript display engine for mathematics that works in all modern browsers.

No more setup for readers. No more browser plugins. No more font installations…. It just works.

I have installed it for this blog.

Please let me know if you have issues viewing mathematics in future posts.

Graph-based Algorithms….

Filed under: Graphs,Information Retrieval,Natural Language Processing — Patrick Durusau @ 7:50 am

Graph-based Algorithms for Information Retrieval and Natural Language Processing

Tutorial at HLT/NAACL 2006 (June 4, 2006)

Rada Mihalcea and Dragomir Radev

From the slides:

  • Motivation
    • Graph-theory is a well studied discipline
    • So are the fields of Information Retrieval (IR) and Natural Language Processing (NLP)
    • Often perceived as completely different disciplines
  • Goal of the tutorial: provide an overview of method and applications in IR and NLP that rely on graph-based algorithms, e.g.
    • Graph-based algorithms: graph traversal, min-cut algorithms, random walks
    • Applied to: Web search, text understanding, text summarization, keyword extraction, text clustering

Nice introduction to graph-theory and why we should care. A lot.

TextGraphs-6: Graph-based Methods for Natural Language Processing

Filed under: Conferences,Graphs,Natural Language Processing — Patrick Durusau @ 7:47 am

TextGraphs-6: Graph-based Methods for Natural Language Processing

From the website:

TextGraphs is at its SIXTH edition! This shows that two seemingly distinct disciplines, graph theoretic models and computational linguistics, are in fact intimately connected, with a large variety of Natural Language Processing (NLP) applications adopting efficient and elegant solutions from graph-theoretical framework. The TextGraphs workshop series addresses a broad spectrum of research areas and brings together specialists working on graph-based models and algorithms for NLP and computational linguistics, as well as on the theoretical foundations of related graph-based methods. This workshop series is aimed at fostering an exchange of ideas by facilitating a discussion about both the techniques and the theoretical justification of the empirical results among the NLP community members.

Special Theme: “Graphs in Structured Input/Output Learning”

Recent work in machine learning has provided interesting approaches to globally represent and process structures, e.g.:

  • graphical models, which encode observations, labels and their dependencies as nodes and edges of graphs
  • kernel-based machines which can encode graphs with structural kernels in the learning; algorithms
  • SVM-struct and other max margin methods and the structured perceptron that allow for outputting entire structures like for example graphs

Important dates:

April 1, 2011 Submission deadline
April 25th, 2011 Notification of acceptance
May 6th, 2011 Camera-ready copies due
June 23th, 2011 Textgraphs workshop at ACL-HLT 2011

As if Neo4J and Gremlin weren’t enough of an incentive to be interested in graph approaches. 😉

Association for Computational Linguistics: Human Language Technologies (2011 Portland)

Filed under: Computational Linguistics,Conferences — Patrick Durusau @ 7:40 am

Association for Computational Linguistics: Human Language Technologies (49th annual meeting)

The time for submitting papers is past but a quick look at the list of accepted papers gives plenty of reasons to attend.

To be held at the Portland Marriott Downtown Waterfront in Portland, Oregon, USA, June 19-24, 2011.

So you don’t miss 2012, it will be held on Jeju Island, Republic of Korea. I have been to Jeju Island. It is awesome!

Sixth International Conference on Knowledge Capture – K-Cap 2011

Sixth International Conference on Knowledge Capture – K-Cap 2011

From the website:

In today’s knowledge-driven world, effective access to and use of information is a key enabler for progress. Modern technologies not only are themselves knowledge-intensive technologies, but also produce enormous amounts of new information that we must process and aggregate. These technologies require knowledge capture, which involve the extraction of useful knowledge from vast and diverse sources of information as well as its acquisition directly from users. Driven by the demands for knowledge-based applications and the unprecedented availability of information on the Web, the study of knowledge capture has a renewed importance.

Researchers that work in the area of knowledge capture traditionally belong to several distinct research communities, including knowledge engineering, machine learning, natural language processing, human-computer interaction, artificial intelligence, social networks and the Semantic Web. K-CAP 2011 will provide a forum that brings together members of disparate research communities that are interested in efficiently capturing knowledge from a variety of sources and in creating representations that can be useful for reasoning, analysis, and other forms of machine processing. We solicit high-quality research papers for publication and presentation at our conference. Our aim is to promote multidisciplinary research that could lead to a new generation of tools and methodologies for knowledge capture.

Conference:

25 – 29 June 2011
Banff Conference Centre
Banff, Alberta, Canada

Call for papers has closed. Will try to post a note about the conference earlier next year.

Proceedings from previous conferences available through the ACM Digital Library – Knowledge Capture.

Let me know if you have trouble with the ACM link. I sometimes don’t get removal of all the tracing cruft off of URLs correct. There really should be a “clean” URL option for sites like the ACM.

Personal Semantic Data – PSD 2011

Filed under: Conferences,RDF,Semantic Web,Semantics — Patrick Durusau @ 6:51 am

Personal Semantic Data – PSD 2011

From the website:

Personal Semantic Data is scattered over several media, and while semantic technologies are already successfully deployed on the Web as well as on the desktop, data integration is not always straightforward. The transition from the desktop to a distributed system for Personal Information Management (PIM) raises new challenges which need to be addressed. These challenges overlap areas related to human-computer interaction, user modeling, privacy and security, information extraction, retrieval and matching.

With the growth of the Web, a lot of personal information is kept online, on websites like Google, Amazon, Flickr, YouTube, Facebook. We also store pieces of personal information on our computers, on our phones and other devices. All the data is important, that’s why we keep it, but managing such a fragmented system becomes a chore on its own instead of providing support and information for doing the tasks we have to do. Adding to the challenge are proprietary formats and locked silos (online or offline in applications).

The Semantic Web enables the creation of structured and interlinked data through the use of common vocabularies to describe it, and a common representation – RDF. Through projects like Linking Open Data (LOD), SIOC and FOAF, large amounts of data is available now on the Web in structured form, including personal information about people and their social relationships. Applying semantic technologies to the desktop resulted in the Semantic Desktop, which provides a framework for linking data on the desktop.

The challenge lies in extending the benefits of the semantic technologies across the borders of the different environments, and providing a uniform view of one’s personal information regardless of where it resides, which vocabularies were used to describe it and how it is represented. Sharing personal semantic data is also challenging, with privacy and security being two of the most important and difficult issues to tackle.

Important Dates:

15 April 2011 – Submission deadline
30 April 2011 – Author notification
10 May 2011 – Camera-ready version
26 June 2011 – Workshop day

I think the secret of semantic integration is the more information that becomes available, the more heterogeneous the systems and information become and the greater the need for topic maps.

Mostly because replacing that many systems in a coordinated way, over the vast diversity of interests and users, simply isn’t possible.

Would be nice to have a showing of interest by topic maps at this workshop.

March 13, 2011

Text Analytics Tools and Runtime for IBM LanguageWare

Filed under: Text Analytics,Topic Maps — Patrick Durusau @ 4:26 pm

Text Analytics Tools and Runtime for IBM LanguageWare

From the website:

IBM LanguageWare is a technology which provides a full range of text analysis functions. It is used extensively throughout the IBM product suite and is successfully deployed in solutions which focus on mining facts from large repositories of text. With support for more than 20 languages, LanguageWare is the ideal solution for extracting the value locked up in unstructured text information and exposing it to business applications. With the emerging importance of Business Intelligence and the explosion in text-based information, the need to exploit this “hidden” information has never been so great. LanguageWare technology not only provides the functionality to address this need, it also makes it easier than ever to create, manage and deploy analysis engines and their resources.

It comprises Java libraries with a large set of features and the linguistic resources that supplement them. It also comprises an easy-to-use Eclipse-based development environment for building custom text analysis applications. In a few clicks, it is possible to create and deploy UIMA (Unstructured Information Management Architecture) annotators that perform everything from simple dictionary lookups to more sophisticated syntactic and semantic analysis of texts using dictionaries, rules and ontologies.

The LanguageWare libraries provide the following non-exhaustive list of features: dictionary look-up and fuzzy look-up, lexical analysis, language identification, spelling correction, hyphenation, normalization, part-of-speech disambiguation, syntactic parsing, semantic analysis, facts/entities extraction and relationship extraction. For more details see the documentation.

The LanguageWare Resource Workbench provides a complete development environment for the building and customization of dictionaries, rules, ontologies and associated UIMA annotators. This environment removes the need for specialist knowledge of the underlying technologies of natural language processing or UIMA. In doing so, it allows the user to focus on the concepts and relationships of interest, and to develop analyzers which extract them from text without having to write any code. The resulting application code is wrapped as UIMA annotators, which can be seamlessly plugged into any application that is UIMA-compliant.

IBM has attracted a lot of attention with its Jeopardy playing “Watson,” and that isn’t necessarily a bad thing.

Personally I am hopeful that it will spur a greater interest in both the humanities as well as CS. Humanities because CS in its absence lacks a lot of interesting problems and CS because that can result in software for the rest of us to use.

Many years ago, before CS became professional or at least as professional as it is now, there was a healthy mixture of math, engineering, humanists and what would become computer scientists in computer science projects.

This software package may be a good way to attract a better cross-section of people to a project.

Not sure if finding others for collaboration will be easier in a university setting (with sharp department lines) or in a public setting where people may be looking for projects outside of work in the public interest.

Possible project questions:

  1. Define a project where you would use these text analytic tools. (3-5 pages, no citations)
  2. What other disciplines would you involve and how would you persuade them to participate? (3-5 pages, no citations)
  3. How would you involve topic maps in your project and why? (3-5 pages, no citations)
  4. How would you use these tools to populate your topic maps? (5-7 pages, no citations)

Zotonic – The Erlang CMS

Filed under: NoSQL,Zotonic — Patrick Durusau @ 4:25 pm

Zotonic – The Erlang CMS

From the documentation:

The Zotonic data model has two tables at its core. The rsc (resource aka page) table and the edge table. All other tables are for access control, visitor administration, configuration and other purposes.

For simplicity of communication the rsc record is often referred to as a page. As every rsc record can have their own page on the web site.

Zotonic is a mix between a traditional database and a triple store. Some page (rsc record) properties are stored as columns, some are serialized in a binary column and some are represented as directed edges to other pages.

In Zotonic there is no real distinction between rsc records that are a person, a news item, a video or something else. The only difference is the category of the rsc record. And the rsc’s category can be changed. Even categories and predicates are represented as rsc records and can, subsequently, have their own page on the web site.

Interesting last sentence: “Even categories and predicates are represented as rsc records and can, subsequently, have their own page on the web site.”

And one assumes the same to be true for categories and predicates in those “own page[s] on the web site.”

Questions:

  1. What use would you make of a CMS in a library environment? (3-5 pages, no citations)
  2. What subject identity issues are left unresolved by a CMS, such as Zotonic? (3-5 pages, no citations)
  3. What use cases would you write for your library director/board/funding organization to add subject identity management to Zotonic? (3-5 pages, no citations)

It isn’t enough that you recognize a problem and have a cool solution, even an effective one.

That is a necessary but not sufficient condition for success.

An effective librarian can:

  1. Recognize an information problem
  2. Find an effective solution for it (within resource/budget constraints)
  3. Communicate #1 and #2 to others, especially decision makers

I know lots of people who can do #1.

A fair number who can do #2, but who sit around at lunch or the snack machine and bitch about how if they were in charge things would be different. Yeah, probably worse.

The trick is to be able to do #3.

Eye-Tracking Results

Filed under: Interface Research/Design — Patrick Durusau @ 4:23 pm

The Use of Eye-Tracking to Evaluate the Effects of Format, Search Type, and Search Engine on the User’s Processing of a Search Results Page1

You know, they really should offer a short course on writing effective paper titles.

Maybe I should offer a strictly non-credit course, once a month, say for 3 hours.

Consisting solely of re-writing article titles to be interesting, informative and != complete sentences. 😉

I found the data section part of the report puzzling because it appeared to me to present the data in both graphic as well as prose forms.

Puzzling because if the data could be clearly presented either way, then why both?

Not to mention that AOI appears 94 times in 24 page document.

The research itself is interesting and merits a better presentation that it gets in this paper.

Do read the paper and look past its editorial issues.

More research like this, at least in terms of the care shown the design and execution of the research, could prove to be quite useful.

That is assuming the resulting data is publicly archived.

Maven-Lucene-Plugin

Filed under: Lucene — Patrick Durusau @ 4:22 pm

Maven-Lucene-Plugin

From the website:

This project is a maven plugin for Apache Lucene. Using it, a Lucene index (configuration inside a xml file) can be created from different datasources ( file/database/xml etc.). A Searcher Util helps in searching the index. Use Lucene without coding.

New project that is looking for volunteers.

Looks like a good way to learn more about Lucene while possibly making a contribution to the community.

March 12, 2011

UK Science, Media, Railway Data Dump!

Filed under: Dataset — Patrick Durusau @ 6:48 pm

Documentation for collections data from Science Museum, National Media Museum, National Railway Museum (NMSI) released as CSV was the original title.

OK, so I took some liberties with the title.

It is one thing to have an interesting data set. It is quite another to get enough attention to encourage its use.

Pass this along to science, media and railroad sites and lists. I am sure some of the partisans there will be interested.

Questions: (Remember, I promised to return to these.)

  1. Choose one of the collections. Describe the topic map you would create with the data. (4-6 pages, no citations)
  2. What aspects of your topic map make it easier to incorporate additional information from other sources? (4-6 pages, no citations)
  3. Outline your design of an interface for delivery of content from your topic map. (4-6 pages, no citations)
  4. For extra credit, up to and including no final, create your topic map. (subject to instructor approval)

Social Analytics on MongoDB

Filed under: Analytics,MongoDB,NoSQL — Patrick Durusau @ 6:48 pm

Social Analytics on MongoDB, Patrick Stokes of Buddy Media does a highly entertaining presentation on MongoDB and its adoption by Buddy Media.

Unfortunately the slides don’t display during the presentation.

Still, refreshing in the honesty about the development process.

PS: I have written to ask about where to find the slides.

Update

You can find the slides at: http://www.slideshare.net/pstokes2/social-analytics-with-mongodb

Neo4j 1.3 “Abisko Lampa” M04 – Size really does matter

Filed under: Graphs,Neo4j,NoSQL — Patrick Durusau @ 6:47 pm

Neo4j 1.3 “Abisko Lampa” M04 – Size really does matter

Fourth milestone release on the way to Neo4J 1.3 release.

From the announcement:

A database can now contain 32 billion nodes, 32 billion relationships and 64 billion properties. Before this, you had to make do with a puny 4 billion nodes, 4 billion relationships and 4 billion properties. Finally, every single person on Earth can have their own personal node! And did we mention this is happening without adding even one byte to the size of your database?

Well, they are graph database, not population, experts. (The current world population being just a few shy of 32 billion. At last count.)

😉

Still, shadows of things that will be, must be.

Allura

Filed under: Marketing,Software,Topic Maps — Patrick Durusau @ 6:47 pm

Allura

From the website:

Allura is an open source implementation of a software “forge”, a web site that manages source code repositories, bug reports, discussions, mailing lists, wiki pages, blogs and more for any number of individual projects.

SourceForge.net is running an instance of Allura (aka New Forge, or Forge 2.0)….

Among the many areas where topic maps could make a noticeable difference is software development.

If you have ever tried to use any of the report databases, maintained by either commercial vendors or open source projects, you know what I mean.

Some are undoubtedly better than others but I have never seen one I would want to re-visit.

But, no source code management project is going to simply adopt topic maps because you or I suggest it or someone else thinks it is a good idea.

Well, its an open project so here is your chance to work towards topic maps becoming part of this project!

Before you join the discussion lists, etc., a few questions/suggestions:

  1. Spend some time studying the project and its code. What are its current priorities? How can you contribute to those, so that later suggestions by you may find favor?
  2. Where in a source code management system is subject identity the most critical? Suggest you find 2 or at the most 3 and then propose changes for only 1 initially.
  3. How would you measure the difference that management of subject identity makes for participants? (Whether they are aware of the contribution of topic maps or not.)

Learn MongoDB Basics

Filed under: MongoDB,NoSQL — Patrick Durusau @ 6:47 pm

Learn MongoDB Basics

Covers the basics of the MongoDB.

One nice aspect is the immediate feedback/reinforcement of the principles being taught.

I can imagine someone creating a resource for topic maps along these lines.

Hopefully both a web as well as local version.

I don’t think the benefits of using topic maps are in dispute.

What is unclear is how to convey those benefits to user?

Comments/suggestions?

Questions:

What aspects of the Learn MongoDB site did you find the least/most helpul? (2-3 pages, no citations)
  • How would you construct a topic map tutorial using this as a guide? (4-6 pages, no citations)
  • What other illustrations would you use to convey the advantages of topic maps? (4-6 pages no citations)
  • Combining Pattern Classifiers: Methods and Algorithms

    Filed under: Bayesian Models,Classifier,Classifier Fusion,Linear Regression,Neighbors — Patrick Durusau @ 6:46 pm

    Combining Pattern Classifiers: Methods and Algorithms, Ludmila I. Kuncheva (2004)

    WorldCat entry: Combining Pattern Classifiers: Methods and Algorithms

    From the preface:

    Everyday life throws at us an endless number of pattern recognition problems: smells, images, voices, faces, situations, and so on. Most of these problems we solve at a sensory level or intuitively, without an explicit method or algorithm. As soon as we are able to provide an algorithm the problem becomes trivial and we happily delegate it to the computer. Indeed, machines have confidently replaced humans in many formerly difficult or impossible, now just tedious pattern recognition tasks such as mail sorting, medical test reading, military target recognition, signature verification, meteorological forecasting, DNA matching, fingerprint recognition, and so on.

    In the past, pattern recognition focused on designing single classifiers. This book is about combining the “opinions” of an ensemble of pattern classifiers in the hope that the new opinion will be better than the individual ones. “Vox populi, vox Dei.”

    The field of combining classifiers is like a teenager: full of energy, enthusiasm, spontaneity, and confusion; undergoing quick changes and obstructing the attempts to bring some order to its cluttered box of accessories. When I started writing this book, the field was small and tidy, but it has grown so rapidly that I am faced with the Herculean task of cutting out a (hopefully) useful piece of this rich, dynamic, and loosely structured discipline. This will explain why some methods and algorithms are only sketched, mentioned, or even left out and why there is a chapter called “Miscellanea” containing a collection of important topics that I could not fit anywhere else.

    Appreciate the author’s suggesting of older material to see how the pattern recognition developed.

    Suggestions/comments on this or later literature on pattern recognition?

    March 11, 2011

    Now We Are 1 (and a few days)

    Filed under: News — Patrick Durusau @ 7:39 pm

    I suppose it was just the press of blog entries but I missed the one year anniversary of Another Word For It.

    Hard to think that a year has gone by since Is 00.7% of Relevant Documents Enough?, 1 March 2010.

    Although I write as much for myself as any audience I do hope that at least some of my posts prove useful for others.

    One of the things I have learned in the past year is that topic map relevant content is just about anywhere you look for it.

    From data sets that would be interesting to see topic mapped, to data mining techniques that would help in identifying likely subjects in such data sets and processing techniques both for data sets and topic maps, to say nothing of all the issues related to interfaces and other issues, there is no shortage of material.

    I have gotten lax on suggesting questions and activities for most posts so look for that to make a come back.

    One of the reasons for the blog is to gather material for my graduate class on topic maps.

    Either I can do the questions as I gather the materials or face a big rush to do them when I am putting course materials together. The better idea is to do them as I find the material.

    In the coming year I need to review my prior posts for followups and expansions where necessary.

    Speaking of spending time on the blog, let me call your attention to my Donations page.

    Your support will help keep this blog an active source of information on topic maps and semantic diversity! (Thanks!)

    factorie: Probabilistic programming with imperatively-defined factor graphs

    Filed under: Factor Graphs,Probabilistic Programming — Patrick Durusau @ 7:01 pm

    factorie: Probabilistic programming with imperatively-defined factor graphs

    The website says factorie has been applied to:

    FACTORIE has been successfully applied to various tasks in natural language processing and information integration, including

    • named entity recognition
    • entity resolution
    • relation extraction
    • parsing
    • schema matching
    • ontology alignment
    • latent-variable generative models, including latent Dirichlet allocation.

    Sound like topic map tasks to me!

    Currently at version 0.90 but the website indicates the the project is planning on a 1.0 release in early 2011.

    Just so you know what you are looking forward to:

    FACTORIE is a toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating relational factor graphs, estimating parameters and performing inference. Key features:

    • It is object-oriented, enabling encapsulation, abstraction and inheritance in the definition of random variables, factors, inference and learning methods.
    • It is scalable, with demonstrated success on problems with many millions of variables and factors, and on models that have changing structure, such as case factor diagrams. It has also been plugged into a database back-end, representing a new approach to probabilistic databases capable of handling billions of variables.
    • It is flexible, supporting multiple modeling and inference paradigms. Its original emphasis was on conditional random fields, undirected graphical models, MCMC inference, online training, and discriminative parameter estimation. However, it now also supports directed generative models (such as latent Dirichlet allocation), and has preliminary support for variational inference, including belief propagation and mean-field methods.
    • It is embedded into a general purpose programming language, providing model authors with familiar and extensive resources for implementing the procedural aspects of their solution, including the ability to beneficially mix data pre-processing, diagnostics, evaluation, and other book-keeping code in the same files as the probabilistic model specification.
    • It allows the use of imperative (procedural) constructs to define the factor graph—an unusual and powerful facet that enables significant efficiencies and also supports the injection of both declarative and procedural domain knowledge into model design.

    The structure of generative models can be expressed as a program that describes the generative storyline. The structure undirected graphical models can be specified in an entity-relationship language, in which the factor templates are expressed as compatibility functions on arbitrary entity-relationship expressions; alternatively, factor templates may also be specified as formulas in first-order logic. However, most generally, data can be stored in arbitrary data structures (much as one would in deterministic programming), and the connectivity patterns of factor templates can be specified in a Turing-complete imperative style. This usage of imperative programming to define various aspects of factor graph construction and operation is an innovation originated in FACTORIE; we term this approach imperatively-defined factor graphs. The above three methods for specifying relational factor graph structure can be mixed in the same model.

    S3T 2011: Third International Conference on Software, Services and Semantic Technologies

    Filed under: Conferences,Semantics — Patrick Durusau @ 7:00 pm

    S3T 2011: Third International Conference on Software, Services and Semantic Technologies

    From the announcement:

    S3T 2011 is the third conference in a series aimed at providing a forum for connecting researchers and international research communities for worldwide dissemination and sharing of ideas and results in the areas of Information and Communication Technologies, and more specifically in Software, Services and Intelligent Content and Semantics. Four coherently interrelated tracks will be arranged in the two-day conference including Software and services, Intelligent content and semantics, Technology enhanced learning, and Knowledge management, Business intelligence, and Innovation. Researchers and graduate students are welcomed to participate in paper presentations, doctoral student consortia and panel discussions under the themes of the conference tracks. The conference is sponsored by the F7 EU SISTER Project and hosted by Sofia University.

    Important Dates:

    Submission of papers May 4, 2011
    Notification of acceptance June 15, 2011
    Submission of final versions July 15, 2011
    Early registration July 05, 2011
    Registration July 25,2011

    Conference: September 1 – 2, 2011, Bourgas, Bulgaria

    Darina Dicheva, one of the conference chairs, is a long time topic map supporter/booster so let’s show our support by submitting topic map based papers for this conference!

    agamemnon

    Filed under: Cassandra,Graphs,NoSQL — Patrick Durusau @ 6:59 pm

    agamemnon

    From the website:

    Agamemnon is a thin library built on top of pycassa. It allows you to use the Cassandra database (http://cassandra.apache.org) as a graph database. Much of the api was inspired by the excellent neo4j.py project (http://components.neo4j.org/neo4j.py/snapshot/)

    Thanks to Jack Park for pointing this out!

    March 10, 2011

    evo*2011

    Filed under: Data Mining,Evoluntionary,Machine Learning — Patrick Durusau @ 12:32 pm

    evo*2011

    From the website:

    evo* comprises the premier co-located conferences in the field of Evolutionary Computing: eurogp, evocop, evobio and evoapplications.

    Featuring the latest in theoretical and applied research, evo* topics include recent genetic programming challenges, evolutionary and other meta-heuristic approaches for combinatorial optimization, evolutionary algorithms, machine learning and data mining techniques in the biosciences, in numerical optimization, in music and art domains, in image analysis and signal processing, in hardware optimization and in a wide range of applications to scientific, industrial, financial and other real-world problems.

    Conference is 27-29 April 2011 in Torino, Italy.

    Even if you are not in the neighborhood, the paper abstracts make an interesting read!

    PSB 2012

    Filed under: Bioinformatics,Biomedical,Conferences — Patrick Durusau @ 11:49 am

    PSB 2012

    From the website:

    The Pacific Symposium on Biocomputing (PSB) 2012 is an international, multidisciplinary conference for the presentation and discussion of current research in the theory and application of computational methods in problems of biological significance. Papers and presentations are rigorously peer reviewed and are published in an archival proceedings volume. PSB 2012 will be held January 3-7, 2012 at the Fairmont Orchid on the Big Island of Hawaii. Tutorials will be offered prior to the start of the conference.

    PSB 2012 will bring together top researchers from the US, the Asian Pacific nations, and around the world to exchange research results and address open issues in all aspects of computational biology. PSB is a forum for the presentation of work in databases, algorithms, interfaces, visualization, modeling, and other computational methods, as applied to biological problems, with emphasis on applications in data-rich areas of molecular biology.

    The PSB has been designed to be responsive to the need for critical mass in sub-disciplines within biocomputing. For that reason, it is the only meeting whose sessions are defined dynamically each year in response to specific proposals. PSB sessions are organized by leaders in the emerging areas and targeted to provide a forum for publication and discussion of research in biocomputing’s “hot topics.” In this way, PSB provides an early forum for serious examination of emerging methods and approaches in this rapidly changing field.

    Proceeding from 1996 are available online (approx. 90%)

    I will be looking through the proceeding to pull out ones that may be of particular interest to the topic maps community.

    « Newer PostsOlder Posts »

    Powered by WordPress