June « 2012 « Another Word For It

June 4, 2012

Predictive Analytics: NeuralNet, Bayesian, SVM, KNN [part 4]

Filed under: Bayesian Data Analysis,Neural Networks,Support Vector Machines — Patrick Durusau @ 4:29 pm

Predictive Analytics: NeuralNet, Bayesian, SVM, KNN by Ricky Ho.

From the post:

Continuing from my previous blog in walking down the list of Machine Learning techniques. In this post, we’ll be covering Neural Network, Support Vector Machine, Naive Bayes and Nearest Neighbor. Again, we’ll be using the same iris data set that we prepared in the last blog.

Ricky continues his march through machine learning techniques. This post promises one more to go.

Comments Off

June 3, 2012

Predictive Analytics: Generalized Linear Regression [part 3]

Filed under: Linear Regression,Machine Learning,Predictive Analytics — Patrick Durusau @ 3:41 pm

Predictive Analytics: Generalized Linear Regression by Ricky Ho.

From the post:

In the previous 2 posts, we have covered how to visualize input data to explore strong signals as well as how to prepare input data to a form that is situation for learning. In this and subsequent posts, I’ll go through various machine learning techniques to build our predictive model.

Linear regression

Logistic regression

Linear and Logistic regression with regularization

Neural network

Support Vector Machine

Naive Bayes

Nearest Neighbor

Decision Tree

Random Forest

Gradient Boosted Trees

There are two general types of problems that we are interested in this discussion; Classification is about predicting a category (value that is discrete, finite with no ordering implied) while Regression is about predicting a numeric quantity (value is continuous, infinite with ordering).

For classification problem, we use the “iris” data set and predict its “species” from its “width” and “length” measures of sepals and petals. Here is how we setup our training and testing data.

Ricky walks you through linear regression, logistic regression and linear and logistic regression with regularization.

Comments Off

Semi-Supervised Named Entity Recognition:… [Marketing?]

Filed under: Entities,Entity Extraction,Entity Resolution,Marketing — Patrick Durusau @ 3:40 pm

Semi-Supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision by David Nadeau (PhD Thesis, University of Ottawa, 2007).

Abstract:

Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts.

Nadeau demonstrates the successful construction of a Named Entity Recognition (NER) system using a few supplied examples for each entity.

But what explains the lack of annotation where the entities are well known? The King James Bible? Search for “Joseph.” We know not all of the occurrences of “Joseph” represent the same entity.

Looking at the client list for Infoglutton, is there a lack of interest in named entity recognition?

Have we focused on techniques and issues that interest us, and then, as an afterthought, tried to market the results to consumers?

Comments Off

A Survey of Named Entity Recognition and Classification

Filed under: Entities,Entity Extraction,Entity Resolution — Patrick Durusau @ 3:40 pm

A Survey of Named Entity Recognition and Classification by David Nadeau, Satoshi Sekine (Journal of Linguisticae Investigationes 30:1 ; 2007)

Abstract:

The term “Named Entity”, now widely used in Natural Language Processing, was coined for the Sixth Message Understanding Conference (MUC-6) (R. Grishman & Sundheim 1996). At that time, MUC was focusing on Information Extraction (IE) tasks where structured information of company activities and defense related activities is extracted from unstructured text, such as newspaper articles. In defining the task, people noticed that it is essential to recognize information units like names, including person, organization and location names, and numeric expressions including time, date, money and percent expressions. Identifying references to these entities in text was recognized as one of the important sub-tasks of IE and was called “Named Entity Recognition and Classification (NERC)”. We present here a survey of fifteen years of research in the NERC field, from 1991 to 2006. While early systems were making use of handcrafted rule-based algorithms, modern systems most often resort to machine learning techniques. We survey these techniques as well as other critical aspects of NERC such as features and evaluation methods. It was indeed concluded in a recent conference that the choice of features is at least as important as the choice of technique for obtaining a good NERC system (E. Tjong Kim Sang & De Meulder 2003). Moreover, the way NERC systems are evaluated and compared is essential to progress in the field. To the best of our knowledge, NERC features, techniques, and evaluation methods have not been surveyed extensively yet. The first section of this survey presents some observations on published work from the point of view of activity per year, supported languages, preferred textual genre and domain, and supported entity types. It was collected from the review of a hundred English language papers sampled from the major conferences and journals. We do not claim this review to be exhaustive or representative of all the research in all languages, but we believe it gives a good feel for the breadth and depth of previous work. Section 2 covers the algorithmic techniques that were proposed for addressing the NERC task. Most techniques are borrowed from the Machine Learning (ML) field. Instead of elaborating on techniques themselves, the third section lists and classifies the proposed features, i.e., descriptions and characteristic of words for algorithmic consumption. Section 4 presents some of the evaluation paradigms that were proposed throughout the major forums. Finally, we present our conclusions.

A bit dated now (2007) but a good starting point for named entity recognition research. The bibliography runs a little over four (4) pages and running those citations forward should capture most of the current research.

Comments Off

Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity

Filed under: Entity Extraction,Entity Resolution,Named Entity Mining — Patrick Durusau @ 3:38 pm

Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity by David Nadeau, Peter D. Turney and Stan Matwin.

Abstract:

In this paper, we propose a named-entity recognition (NER) system that addresses two major limitations frequently discussed in the field. First, the system requires no human intervention such as manually labeling training data or creating gazetteers. Second, the system can handle more than the three classical named-entity types (person, location, and organization). We describe the system’s architecture and compare its performance with a supervised system. We experimentally evaluate the system on a standard corpus, with the three classical named-entity types, and also on a new corpus, with a new named-entity type (car brands).

The authors confide successful application of their techniques to more than 50 named-entity types.

They also recite heuristics that they apply to texts during the mining process.

Is there a common repository of observations or heuristics for mining texts? Just curious.

Source code for the project: http://balie.sourceforge.net.

Answer to the question I just posed?

Comments Off

A Resource-Based Method for Named Entity Extraction and Classification

Filed under: Entities,Entity Extraction,Entity Resolution,Law,Named Entity Mining — Patrick Durusau @ 3:37 pm

A Resource-Based Method for Named Entity Extraction and Classification by Pablo Gamallo and Marcos Garcia. (Lecture Notes in Computer Science, vol. 7026, Springer-Verlag, 610-623. ISNN: 0302-9743).

Abstract:

We propose a resource-based Named Entity Classification (NEC) system, which combines named entity extraction with simple language-independent heuristics. Large lists (gazetteers) of named entities are automatically extracted making use of semi-structured information from the Wikipedia, namely infoboxes and category trees. Language independent heuristics are used to disambiguate and classify entities that have been already identified (or recognized) in text. We compare the performance of our resource-based system with that of a supervised NEC module implemented for the FreeLing suite, which was the winner system in CoNLL-2002 competition. Experiments were performed over Portuguese text corpora taking into account several domains and genres.

Of particular interest if you are interested in adding NEC resources to the FreeLing project.

The introduction starts off:

Named Entity Recognition and Classification (NERC) is the process of identifying and classifying proper names of people, organizations, locations, and other Named Entities (NEs) within text.

Curious, what happens if you don’t have a “named” entity? That is an entity mentioned in the text but that doesn’t (yet) have a proper name?

Thinking of legal texts where some provision may apply to all corporations that engage in activity Y and that have a gross annual income in excess of amount X.

I may want to “recognize” that entity so I can then put a name with that entity.

Comments Off

Reconcile – Coreference Resolution Engine

Filed under: Coreference Resolution,Natural Language Processing — Patrick Durusau @ 3:36 pm

Reconcile – Coreference Resolution Engine

While we are on the topic of NLP tools:

Reconcile is an automatic coreference resolution system that was developed to provide a stable test-bed for researchers to implement new ideas quickly and reliably. It achieves roughly state of the art performance on many of the most common coreference resolution test sets, such as MUC-6, MUC-7, and ACE. Reconcile comes ready out of the box to train and test on these common data sets (though the data sets are not provided) as well as the ability to run on unlabeled texts. Reconcile utilizes supervised machine learning classifiers from the Weka toolkit, as well as other language processing tools such as the Berkeley Parser and Stanford Named Entity Recognition system.

The source language is Java, and it is freely available under the GPL.

Just in case you want to tune/tweak your coreference resolution against your data sets.

Comments Off

FreeLing 3.0 – An Open Source Suite of Language Analyzers

Filed under: Coreference Resolution,Language,Named Entity Mining,Natural Language Processing,WordNet — Patrick Durusau @ 3:36 pm

FreeLing 3.0 – An Open Source Suite of Language Analyzers

Features:

Main services offered by FreeLing library:

Text tokenization

Sentence splitting

Morphological analysis

Suffix treatment, retokenization of clitic pronouns

Flexible multiword recognition

Contraction splitting

Probabilistic prediction of unkown word categories

Named entity detection

Recognition of dates, numbers, ratios, currency, and physical magnitudes (speed, weight, temperature, density, etc.)

PoS tagging

Chart-based shallow parsing

Named entity classification

WordNet based sense annotation and disambiguation

Rule-based dependency parsing

Nominal correference resolution

[Not all features are supported for all languages, see Supported Languages.]

TOC for the user manual.

Something for your topic map authoring toolkit!

(Source: Jack Park)

Comments (2)

Creating a Semantic Graph from Wikipedia

Filed under: Dependency Graphs,Natural Language Processing,Semantic Graph,Semantics,Wikipedia — Patrick Durusau @ 3:35 pm

Creating a Semantic Graph from Wikipedia by Ryan Tanner, Trinity University.

Abstract:

With the continued need to organize and automate the use of data, solutions are needed to transform unstructred text into structred information. By treating dependency grammar functions as programming language functions, this process produces \property maps” which connect entities (people, places, events) with snippets of information. These maps are used to construct a semantic graph. By inputting Wikipedia, a large graph of information is produced representing a section of history. The resulting graph allows a user to quickly browse a topic and view the interconnections between entities across history.

Of particular interest is Ryan’s approach to the problem:

Most approaches to this problem rely on extracting as much information as possible from a given input. My approach comes at the problem from the opposite direction and tries to extract a little bit of information very quickly but over an extremely large input set. My hypothesis is that by doing so a large collection of texts can be quickly processed while still yielding useful output.

A refreshing change from semantic orthodoxy that has a happy result.

Printing the thesis now for a close read.

(Source: Jack Park)

Comments Off

Managing Highly Connected Data in Neo4j

Filed under: Graphs,Neo4j — Patrick Durusau @ 3:34 pm

Managing Highly Connected Data in Neo4j by Jim Webber.

In this talk for the Progressive NOSQL Tutorials on Managing Highly Connected Data in Neo4j, Jim Webber will discuss how connected data is driving new classes of innovative applications and investigate the strengths and weaknesses of common NOSQL families for dealing handling it. The focus is on the characteristics of Neo4j for managing connected data, and to reinforce how useful graphs are, we provide a rapid, code-focussed example using Neo4j covering the APIs for manipulating and traversing graphs. We’ll then use this knowledge to explore domains as disparate as social recommendations and the Doctor Who universe, using Neo4j to infer knowledge from connected, semi-structured data.

An entertaining but sanitized view of data structures and database history.

Hypergraphs have been studied and used for decades.

To say nothing of Ted Nelson’s Project Xanadu.

Both of which built on older foundations.

I haven’t seen a presentation that traces those foundations, even partially. Something to think about.

Comments Off

Discussion of scholarly information in research blogs

Filed under: Bibliometrics,Blogs,Citation Analysis,Citation Indexing — Patrick Durusau @ 3:13 pm

Discussion of scholarly information in research blogs by Hadas Shema.

From the post:

As some of you know, Mike Thelwall, Judit Bar-Ilan (both are my dissertation advisors) and myself published an article called “Research Blogs and the Discussion of Scholarly Information” in PLoS One. Many people showed interest in the article, and I thought I’d write a “director’s commentary” post. Naturally, I’m saving all your tweets and blog posts for later research.

The Sample

We characterized 126 blogs with 135 authors from Researchblogging.Org (RB), an aggregator of blog posts dealing with peer-review research. Two over-achievers had two blogs each, and 11 blogs had two authors.

While our interest in research blogs started before we ever heard of RB, it was reading an article using RB that really kick-started the project. Groth & Gurney (2010) wrote an article titled “Studying scientific discourse on the Web using bibliometrics: A chemistry blogging case study.” The article made for a fascinating read, because it applied bibliometric methods to blogs. Just like it says in the title, Groth & Gurney took the references from 295 blog posts about Chemistry and analyzed them the way one would analyze citations from peer-reviewed articles. They managed that because they used RB, which aggregates only posts by bloggers who take the time to formally cite their sources. Major drooling ensued at that point. People citing in a scholarly manner out of their free will? It’s Christmas!

Questions that stand out for me on blogs:

Will our indexing/searching of blogs have the same all or nothing granularity of scholarly articles?

If not, why not?

Comments Off

June 2, 2012

A Competent CTO Can Say No

Filed under: Government,Government Data,Health care — Patrick Durusau @ 9:05 pm

Todd Park, CTO of the United States, should be saying no.

Todd has mandated six months for progress on:

MyGov

Reimagine the relationship between the federal government and its citizens through an online footprint developed not just for the people, but also by the people.

Open Data Initiatives

Stimulate a rising tide of innovation and entrepreneurship that utilizes government data to create tools that help Americans in numerous ways – e.g., apps and services that help people find the right health care provider, identify the college that provides the best value for their money, save money on electricity bills through smarter shopping, or keep their families safe by knowing which products have been recalled.

Blue Button for America

Develop apps and create awareness of tools that help individuals get access to their personal health records — current medications and drug allergies, claims and treatment data, and lab reports – that can improve their health and healthcare.

RFP-EZ

Build a platform that makes it easier for small high-growth businesses to navigate the federal government, and enables agencies to quickly source low-cost, high-impact information technology solutions.

The 20% Campaign

Create a system that enables US government programs to seamlessly move from making cash payments to support foreign policy, development assistance, government operations or commercial activities to using electronic payments such as mobile devices, smart cards and other methods.

This is a classic “death march” pattern.

Having failed to make progress on any of these fronts in forty-two months, President Obama wants to mandate progress in six months.

Progress cannot be mandated and a competent CTO would say no. To the President and anyone who asks.

Progress is possible but only with proper scoping and requirements development.

Don’t further incompetence.

Take the pledge:

I refuse to apply for or if appointed to serve as a Presidential Innovation Fellow “…to deliver significant results in six months.” /s/ Patrick Durusau, Covington, Georgia, 2 June 2012.

(Details: US CTO seeks to scale agile thinking and open data across federal government)

Comments (2)

TechAmerica Foundation Big Data Commission

Filed under: BigData,Government,Government Data — Patrick Durusau @ 9:04 pm

TechAmerica Foundation Big Data Commission

From the post:

Big Data Commission Launch

Data in the world is doubling every 18 months. Across government everyone is talking about the concept of Big Data, and how this new technology will transform the way Washington does business. But looking past the excitement, questions abound. What is Big Data, really? How is it defined? What capabilities are required to succeed? How do you use Big Data to make intelligent decisions? How will agencies effectively govern and secure huge volumes of information, while protecting privacy and civil liberties? And perhaps most importantly, what value will it really deliver to the US Government and the citizenry we serve?

To help answer these questions, and provide guidance to our Government’s senior policy and decision makers, TechAmerica is pleased to announce the formation of the Big Data Commission.

The Commission:

The Commission will be chaired by senior executives from IBM and SAP with vice chairs from Amazon and Wyle and will assemble 25-30 industry leaders, academia, along with a government advisory board with the objective of providing guidance on how Government Agencies should be leveraging Big Data to address their most critical business imperatives, and how Big Data can drive U.S. innovation and competitiveness.

Unlike Todd Park (soon to be former CTO of the United States) in A Competent CTO Can Say No, TechAmerica doesn’t promise “significant results” in six months.

Until the business imperatives of government agencies are understood, it isn’t possible for anyone, however well-intentioned or skilled, to give them useful advice.

Can’t say how well the commission will do at that task, to say nothing of determining what advice to give, but at least it isn’t starting with an arbitrary, election driven deadline.

Comments Off

New open data platform launches

Filed under: Government,Government Data,Graphics,Visualization — Patrick Durusau @ 6:25 pm

New open data platform launches

Kim Rees (of Flowing Data) writes:

Open data is everywhere. However, open data initiatives often manifest as mere CSV dumps on a forlorn web page. Junar, Lunfardo (Argentina slang) for “to know” or “to view,” seeks to help government and organizations take the guesswork out of developing their own software for such efforts.

If you are looking to explore options for making data available, this is worth a stop.

It won’t make you an expert at data visualization, any more than a copy of Excel™ will make you a business analyst. But having the right tools for a job never hurts.

Comments Off

dipLODocus[RDF]

Filed under: RDF,Semantic Web — Patrick Durusau @ 6:17 pm

dipLODocus[RDF]

From the webpage:

dipLODocus[RDF] is a new system for RDF data processing supporting both simple transactional queries and complex analytics efficiently. dipLODocus[RDF] is based on a novel hybrid storage model considering RDF data both from a graph perspective (by storing RDF subgraphs or RDF molecules) and from a “vertical” analytics perspective (by storing compact lists of literal values for a given attribute).

Overview

Our system is built on three main structures: RDF molecule clusters (which can be seen as hybrid structures borrowing both from property tables and RDF subgraphs), template lists (storing literals in compact lists as in a column-oriented database system) and an efficient hash-table indexing URIs and literals based on the clusters they belong to.

Figure below gives a simple example of a few molecule clusters—storing information about students—and of a template list—compactly storing lists of student IDs. Molecules can be seen as horizontal structures storing information about a given object instance in the database (like rows in relational systems). Template lists, on the other hand, store vertical lists of values corresponding to one type of object (like columns in a relational system).

Interesting performance numbers:

30x RDF-3X on LUBM queries
350x Virtuoso on analytic queries

Combines data structures as opposed to adopting one single approach.

Perhaps data structures will be explored and optimized for data, rather than the other way around?

dipLODocus[RDF] | Short and Long-Tail RDF Analytics for Massive Webs of Data by Marcin Wylot, Jigé Pont, Mariusz Wisniewski, and Philippe Cudré-Mauroux (paper – PDF).

I first saw this at the SemanticWeb.com.

Comments Off

Blocky – A Visual Programming Language

Filed under: Blocky,Programming — Patrick Durusau @ 4:44 pm

Blocky – A Visual Programming Language

From the web page:

Blockly is a web-based, graphical programming language. Users can drag blocks together to build an application. No typing required.

This proposal or something quite close to it has potential for a graphical subject identity or merging language.

Or does it?

You hit the wall with graphical, rubber-ducky sort of languages when you need a block that hasn’t been specified.

Doesn’t bode well for a general visual subject identity language. Possible for a specific domain.

Types of motor vehicles, air or water craft by outline for example.

Perhaps the most common subjects could have icons, supplemented with a small number of pre-defined key/value pairs.

Could work, would depend on the domain and skill of the users.

Comments Off

Depixelizing Pixel Art

Filed under: Graphics,Visualization — Patrick Durusau @ 4:13 pm

Depixelizing Pixel Art by Johannes Kopf and Dani Lischinski.

Abstract:

We describe a novel algorithm for extracting a resolution-independent vector representation from pixel art images, which enables magnifying the results by an arbitrary amount without image degradation. Our algorithm resolves pixel-scale features in the input and converts them into regions with smoothly varying shading that are crisply separated by piecewise-smooth contour curves. In the original image, pixels are represented on a square pixel lattice, where diagonal neighbors are only connected through a single point. This causes thin features to become visually disconnected under magnification by conventional means, and it causes connectedness and separation of diagonal neighbors to be ambiguous. The key to our algorithm is in resolving these ambiguities. This enables us to reshape the pixel cells so that neighboring pixels belonging to the same feature are connected through edges, thereby preserving the feature connectivity under magnification. We reduce pixel aliasing artifacts and improve smoothness by fitting spline curves to contours in the image and optimizing their control points.

Strikes me as the inverse of the thinning you see in: Split a Single-Pixel-Width Connected Line Graph Into Line Segments by The Hit-and-Miss Transformation.

What do you think?

Are there other pixel representations other than pixel art where smoothing of visual representation would be useful?

Comments Off

High-Performance Domain-Specific Languages using Delite

Filed under: Delite,DSL,Machine Learning,Parallel Programming,Scala — Patrick Durusau @ 12:50 pm

High-Performance Domain-Specific Languages using Delite

Description:

This tutorial is an introduction to developing domain specific languages (DSLs) for productivity and performance using Delite. Delite is a Scala infrastructure that simplifies the process of implementing DSLs for parallel computation. The goal of this tutorial is to equip attendees with the knowledge and tools to develop DSLs that can dramatically improve the experience of using high performance computation in important scientific and engineering domains. In the first half of the day we will focus on example DSLs that provide both high-productivity and performance. In the second half of the day we will focus on understanding the infrastructure for implementing DSLs in Scala and developing techniques for defining good DSLs.

The graph manipulation language Green-Marl is one of the subjects of this tutorial.

This resource should be located and “boosted” by a search engine tuned to my preferences.

Skipping breaks, etc., you will find:

Introduction To High Performance DSLs (Kunle Olukotun)
OptiML: A DSL for Machine Learning (Arvind Sujeeth)
Liszt: A DSL for solving mesh-based PDEs (Zach Devito)
Green-Marl: A DSL for efficient Graph Analysis (Sungpack Hong)
Scala Tutorial (Hassan Chafi)
Delite DSL Infrastructure Overview (Kevin Brown)
High Performance DSL Implementation Using Delite (Arvind Sujeeth)
Future Directions in DSL Research (Hassan Chafi)

Compare your desktop computer to the MANIAC 1 (calculations for the first hydrogen bomb).

What have you invented/discovered lately?

Comments Off

Green-Marl

Filed under: DSL,Graphs,Green-Mari — Patrick Durusau @ 10:54 am

Green-Marl

From the website:

Green-Marl [1] is a domain-specific language that is specially designed for graph data analysis. For the further information for the Green-Marl language, refer to the language specification draft [2], which can also be found in this directory in the source package.

‘gm_comp’ is a compiler for Green-Marl. It reads a Green-Marl file and generates an equivalent, efficient and parallelized C++ implementation, i.e. .cc file. More specifically, the compiler produces a C++ function for each Green-Marl procedure. The generated c++ functions can be compiled with gcc and therefore can be merged into any user application that are compilable with gcc.

The C++ codes that are generated by ‘gm_comp’ assume the following libraries:

gcc (with builtin atomic functions)

gcc (with OpenMp support)

a custom graph library and runtime (gm_graph)

The first two are supported by any recent gcc distributions (version 4.2 or higher); the third one is included in this source package.

‘gm_comp’ is also able to generate codes for a completely different target environment (See Section 5).

This is the sort of resource that should appear in a daily “update” about topic map relevant material on the WWW or in the published literature.

The paper, Green-Marl: A DSL for Easy and Efficient Graph Analysis (ASPLOS 2012), by Sungpack Hong, Hassan Chafi and Eric Sedlar, is quite good.

I first saw Green-Marl at Pete Warden’s Five Short Links.

Comments (3)

Fuzzy machine learning framework v1.2

Filed under: Fuzzy Logic,Machine Learning — Patrick Durusau @ 9:48 am

Fuzzy machine learning framework v1.2

From the announcement:

The software is a library as well as a GTK GUI front-end for machine learning projects. Features:

Based on intuitionistic fuzzy sets and the possibility theory;

Features are fuzzy;

Fuzzy classes, which may intersect and can be treated as features;

Numeric, enumeration features and ones based on linguistic variables;

Derived and evaluated features;

Classifiers as features for building hierarchical systems;

User-defined features;

An automatic classification refinement in case of dependent features;

Incremental learning;

Object-oriented software design;

Features, training sets and classifiers are extensible objects;

Automatic garbage collection;

Generic data base support (through ODBC);

Text I/O and HTML routines for features, training sets and classifiers;

GTK+ widgets for features, training sets and classifiers;

Examples of use.

This release is packaged for Windows, Fedora (yum) and Debian (apt). The software is public domain (licensed under GM GPL).

http://www.dmitry-kazakov.de/ada/fuzzy_ml.htm

Unless you have time to waste, I would skip the religious discussion about licensing options.

For IP issues, hire lawyers, not programmers.

Comments Off

June 1, 2012

Extracting Conflict-free Information from Multi-labeled Trees

Filed under: Trees — Patrick Durusau @ 3:03 pm

Extracting Conflict-free Information from Multi-labeled Trees by Akshay Deepak, David Fernández-Baca, and Michelle M. McMahon.

Abstract:

A multi-labeled tree, or MUL-tree, is a phylogenetic tree where two or more leaves share a label, e.g., a species name. A MUL-tree can imply multiple conflicting phylogenetic relationships for the same set of taxa, but can also contain conflict-free information that is of interest and yet is not obvious. We define the information content of a MUL-tree T as the set of all conflict-free quartet topologies implied by T, and define the maximal reduced form of T as the smallest tree that can be obtained from T by pruning leaves and contracting edges while retaining the same information content. We show that any two MUL-trees with the same information content exhibit the same reduced form. This introduces an equivalence relation in MUL-trees with potential applications to comparing MUL-trees. We present an efficient algorithm to reduce a MUL-tree to its maximally reduced form and evaluate its performance on empirical datasets in terms of both quality of the reduced tree and the degree of data reduction achieved.

You may not agree with:

That is, for every MUL-tree $T$ there exists a singly-labeled tree that displays all the conflict-free quartets of $T$ — and possibly some other quartets as well. Motivated by this, we only view conflict-free quartet topologies as informative, and define the information content of a MUL-tree as the set of all conflict-free quartet topologies it implies.

Preferring to view conflicts as information content (I would) but each to his own.

I suspect “multi-labeled” trees are more common than one might expect.

Other examples?

Comments Off

Are You Going to Balisage?

Filed under: Conferences,RDF,RDFa,Semantic Web,XML,XML Database,XML Schema,XPath,XQuery,XSLT — Patrick Durusau @ 2:48 pm

To the tune of “Are You Going to Scarborough Fair:”

Are you going to Balisage?
Parsley, sage, rosemary and thyme.
Remember me to one who is there,
she once was a true love of mine.

Tell her to make me an XML shirt,
Parsley, sage, rosemary, and thyme;
Without any seam or binary code,
Then she shall be a true lover of mine.

….

Oh, sorry! There you will see:

higher-order functions in XSLT
Schematron to enforce consistency constraints
relation of the XML stack (the XDM data model) to JSON
integrating JSON support into XDM-based technologies like XPath, XQuery, and XSLT
XML and non-XML syntaxes for programming languages and documents
type introspection in XQuery
using XML to control processing in a document management system
standardizing use of XQuery to support RESTful web interfaces
RDF to record relations among TEI documents
high-performance knowledge management system using an XML database
a corpus of overlap samples
an XSLT pipeline to translate non-XML markup for overlap into XML
comparative entropy of various representations of XML
interoperability of XML in web browsers
XSLT extension functions to validate OCL constraints in UML models
ontological analysis of documents
statistical methods for exploring large collections of XML data

Balisage is an annual conference devoted to the theory and practice of descriptive markup and related technologies for structuring and managing information. Participants typically include XML users, librarians, archivists, computer scientists, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, Topic-Map enthusiasts, semantic-Web evangelists, members of the working groups which define the specifications, academics, industrial researchers, representatives of governmental bodies and NGOs, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Discussion is open, candid, and unashamedly technical.

The Balisage 2012 Program is now available at: http://www.balisage.net/2012/Program.html

Comments Off

A Computable Universe,
Understanding Computation and
Exploring Nature As Computation

Filed under: Cellular Automata,Computation,Computer Science — Patrick Durusau @ 9:53 am

Foreword: A Computable Universe, Understanding Computation and Exploring Nature As Computation by Roger Penrose.

Abstract:

I am most honoured to have the privilege to present the Foreword to this fascinating and wonderfully varied collection of contributions, concerning the nature of computation and of its deep connection with the operation of those basic laws, known or yet unknown, governing the universe in which we live. Fundamentally deep questions are indeed being grappled with here, and the fact that we find so many different viewpoints is something to be expected, since, in truth, we know little about the foundational nature and origins of these basic laws, despite the immense precision that we so often find revealed in them. Accordingly, it is not surprising that within the viewpoints expressed here is some unabashed speculation, occasionally bordering on just partially justified guesswork, while elsewhere we find a good deal of precise reasoning, some in the form of rigorous mathematical theorems. Both of these are as should be, for without some inspired guesswork we cannot have new ideas as to where look in order to make genuinely new progress, and without precise mathematical reasoning, no less than in precise observation, we cannot know when we are right — or, more usually, when we are wrong.

An unlikely volume to search for data mining or semantic modeling algorithms or patterns.

But one that should be read for the mental exercise/discipline of its reading.

The asking price of $138 (US) promises a limited readership.

Plus a greatly diminished impact.

When asked to participate in collections, scholars/authors should ask themselves:

How many books have I read from publisher X?*

*Read, not cited, is the appropriate test. Make your decision appropriately.

If republished as an accessible paperback, may I suggest: “Exploring the Nature of Computation”?

The committee title makes the collage nature of the volume a bit too obvious.

Comments Off

Linked Data Patterns

Filed under: Linked Data — Patrick Durusau @ 8:57 am

Linked Data Patterns by Leigh Dodds and Ian Davis.

Leigh Dodds posted an email note this morning to announce a new revision:

There have been a number of revisions across the pattern catalogue, including addition of new introductory sections to each chapter. There are a total of 12 new patterns, many of which cover data management patterns relating to use of named graphs.

Without a local copy, I can’t specify what patterns have been added.

Obvious version information (other than date) on the cover “page” and access to prior versions would be a real plus.

Comments Off

Graph Processing Berlin

Filed under: Conferences,Graphs — Patrick Durusau @ 7:17 am

Graph Processing Berlin

From the sign-up/homepage:

Are you curious and interested to learn more about Graph Processing technologies? Are you willing to visit the coolest city in Europe right now? or are you living in town?

If the answer to that questions is positive, we are planning to launch a one day workshop about Graph Processing in Berlin for the last months of this year.

There you will learn the basics for working your graph from the best guys who build, work or collaborate with the most important projects on the field right now.

Our focus will be technologies like Neo4j, Giraph and Hadoop, OrientDB, with a high focus on applications like Recommendations systems, Graph Theory, Analytics, etc…

Register here, and we will send you more information as soon as we reach the minimum level of audience!

Send it to your colleges and friends, as much as we are more fun it will be!

The Graph Processing Berlin team.

If you register you will see:

Relationships are a two way street. You are asking people to give you their recommendation, so what are you giving them?

Asking you for email addresses to advertise the conference.

Today I get an email saying I can track the # people I have entered as email addresses.

#1 Good conferences don’t need guilt tripping, “…what are you giving them.” to get viral advertising.

#2 Spamming my friends isn’t being viral, just vulgar.

The conference may be a good one. Let’s hope its marketing mis-steps are just that, mis-steps.

Comments (3)

« Newer Posts

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 4, 2012

June 3, 2012

June 2, 2012

June 1, 2012