Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 28, 2012

Center for Intelligent Information Retrieval (CIIR) [University of Massachusetts Amherst]

Filed under: Heterogeneous Data,Information Retrieval,Multimedia — Patrick Durusau @ 1:56 pm

Center for Intelligent Information Retrieval (CIIR)

From the webpage:

The Center for Intelligent Information Retrieval (CIIR) is one of the leading research groups working in the areas of information retrieval and information extraction. The CIIR studies and develops tools that provide effective and efficient access to large networks of heterogeneous, multimedia information.

CIIR accomplishments include significant research advances in the areas of retrieval models, distributed information retrieval, information filtering, information extraction, topic models, social network analysis, multimedia indexing and retrieval, document image processing, search engine architecture, text mining, structured data retrieval, summarization, evaluation, novelty detection, resource discovery, interfaces and visualization, digital libraries, computational social science, and cross-lingual information retrieval.

The CIIR has published more than 900 papers on these areas, and has worked with over 90 government and industry partners on research and technology transfer. Open source software supported by the Center is being used worldwide.

Please contact us to talk about potential new projects, collaborations, membership, or joining us as a graduate student or visiting researcher.

To get an idea of the range of their activities, visit the publications page and just browse.

Misconceptions holding back use of data integration tools [Selling tools or data integration?]

Filed under: Data Integration,Marketing — Patrick Durusau @ 1:23 pm

Misconceptions holding back use of data integration tools by Rick Sherman.

From the post:

There’s no question that data integration technology is a good thing. So why aren’t businesses using it as much as they should be?

Data integration software has evolved significantly from the days when it primarily consisted of extract, transform and load (ETL) tools. The technologies available now can automate the process of integrating data from source systems around the world in real time if that’s what companies want. Data integration tools can also increase IT productivity and make it easier to incorporate new data sources into data warehouses and business intelligence (BI) systems for users to analyze.

But despite tremendous gains in the capabilities and performance of data integration tools, as well as expanded offerings in the marketplace, much of the data integration projects in corporate enterprises are still being done through manual coding methods that are inefficient and often not documented. As a result, most companies haven’t gained the productivity and code-reuse benefits that automated data integration processes offer. Instead, they’re deluged with an ever-expanding backlog of data integration work, including the need to continually update and fix older, manually coded integration programs.

Rick’s first sentence captures the problem with promoting data integration:

“There’s no question that data integration technology is a good thing.”

Hypothetical survey of Fortune 1,000 CEO’s:

Question Agree Disagree
Data integration may be a good thing 100% 0%
Data integration technology is a good thing 0.001% 99.999%

Data integration may be a good thing. Depends on what goal or mission is furthered by data integration.

Data integration, by hand, manual coding or data mining, isn’t an end unto itself. Only a means to an end.

Specific data integration, tied to a mission or goal of an organization, has a value to be evaluated against the cost of the tool or service.

Otherwise, we are selling tools of no particular value for some unknown purpose.

Sounds like a misconception of the sales process to me.

‘The Algorithm That Runs the World’ [Optimization, Identity and Polytopes]

Filed under: Algorithms,Dimensions,Identification,Identity,Polytopes — Patrick Durusau @ 12:28 pm

“The Algorithm That Runs the World” by Erwin Gianchandani.

From the post:

New Scientist published a great story last week describing the history and evolution of the simplex algorithm — complete with a table capturing “2000 years of algorithms”:

The simplex algorithm directs wares to their destinations the world over [image courtesy PlainPicture/Gozooma via New Scientist].Its services are called upon thousands of times a second to ensure the world’s business runs smoothly — but are its mathematics as dependable as we thought?

YOU MIGHT not have heard of the algorithm that runs the world. Few people have, though it can determine much that goes on in our day-to-day lives: the food we have to eat, our schedule at work, when the train will come to take us there. Somewhere, in some server basement right now, it is probably working on some aspect of your life tomorrow, next week, in a year’s time.

Perhaps ignorance of the algorithm’s workings is bliss. The door to Plato’s Academy in ancient Athens is said to have borne the legend “let no one ignorant of geometry enter”. That was easy enough to say back then, when geometry was firmly grounded in the three dimensions of space our brains were built to cope with. But the algorithm operates in altogether higher planes. Four, five, thousands or even many millions of dimensions: these are the unimaginable spaces the algorithm’s series of mathematical instructions was devised to probe.

Perhaps, though, we should try a little harder to get our heads round it. Because powerful though it undoubtedly is, the algorithm is running into a spot of bother. Its mathematical underpinnings, though not yet structurally unsound, are beginning to crumble at the edges. With so much resting on it, the algorithm may not be quite as dependable as it once seemed [more following the link].

A fund manager might similarly want to arrange a portfolio optimally to balance risk and expected return over a range of stocks; a railway timetabler to decide how best to roster staff and trains; or a factory or hospital manager to work out how to juggle finite machine resources or ward space. Each such problem can be depicted as a geometrical shape whose number of dimensions is the number of variables in the problem, and whose boundaries are delineated by whatever constraints there are (see diagram). In each case, we need to box our way through this polytope towards its optimal point.

This is the job of the algorithm.

Its full name is the simplex algorithm, and it emerged in the late 1940s from the work of the US mathematician George Dantzig, who had spent the second world war investigating ways to increase the logistical efficiency of the U.S. air force. Dantzig was a pioneer in the field of what he called linear programming, which uses the mathematics of multidimensional polytopes to solve optimisation problems. One of the first insights he arrived at was that the optimum value of the “target function” — the thing we want to maximise or minimise, be that profit, travelling time or whatever — is guaranteed to lie at one of the corners of the polytope. This instantly makes things much more tractable: there are infinitely many points within any polytope, but only ever a finite number of corners.

If we have just a few dimensions and constraints to play with, this fact is all we need. We can feel our way along the edges of the polytope, testing the value of the target function at every corner until we find its sweet spot. But things rapidly escalate. Even just a 10-dimensional problem with 50 constraints — perhaps trying to assign a schedule of work to 10 people with different expertise and time constraints — may already land us with several billion corners to try out.

Apologies but I saw this article too late to post within the “free” days allowed by New Scientist.

But, I think from Erwin’s post and long quote from the original article, you can see how the simplex algorithm may be very useful where identity is defined in multidimensional space.

The literature in this area is vast and it may not offer an appropriate test for all questions of subject identity.

For example, the possessor of a credit card is presumed to be the owner of the card. Other assumptions are possible, but fraud costs are recouped from fees paid by customers. Creating a lack of interest in more stringent identity tests.

On the other hand, if your situation requires multidimensional identity measures, this may be a useful approach.


PS: Be aware that naming confusion, the sort that can be managed (not solved) by topic maps abounds even in mathematics:

The elements of a polytope are its vertices, edges, faces, cells and so on. The terminology for these is not entirely consistent across different authors. To give just a few examples: Some authors use face to refer to an (n−1)-dimensional element while others use face to denote a 2-face specifically, and others use j-face or k-face to indicate an element of j or k dimensions. Some sources use edge to refer to a ridge, while H. S. M. Coxeter uses cell to denote an (n−1)-dimensional element. (Polytope)

23 Mathematical Challenges [DARPA – A Modest Challenge]

Filed under: Challenges,Mathematics,Mathematics Indexing — Patrick Durusau @ 10:50 am

23 Mathematical Challenges [DARPA]

From the webpage:

Discovering novel mathematics will enable the development of new tools to change the way the DoD approaches analysis, modeling and prediction, new materials and physical and biological sciences. The 23 Mathematical Challenges program involves individual researchers and small teams who are addressing one or more of the following 23 mathematical challenges, which if successfully met, could provide revolutionary new techniques to meet the long-term needs of the DoD:

  • Mathematical Challenge 1: The Mathematics of the Brain
  • Mathematical Challenge 2: The Dynamics of Networks
  • Mathematical Challenge 3: Capture and Harness Stochasticity in Nature
  • Mathematical Challenge 4: 21st Century Fluids
  • Mathematical Challenge 5: Biological Quantum Field Theory
  • Mathematical Challenge 6: Computational Duality
  • Mathematical Challenge 7: Occam’s Razor in Many Dimensions
  • Mathematical Challenge 8: Beyond Convex Optimization
  • Mathematical Challenge 9: What are the Physical Consequences of Perelman’s Proof of Thurston’s Geometrization Theorem?
  • Mathematical Challenge 10: Algorithmic Origami and Biology
  • Mathematical Challenge 11: Optimal Nanostructures
  • Mathematical Challenge 12: The Mathematics of Quantum Computing, Algorithms, and Entanglement
  • Mathematical Challenge 13: Creating a Game Theory that Scales
  • Mathematical Challenge 14: An Information Theory for Virus Evolution
  • Mathematical Challenge 15: The Geometry of Genome Space
  • Mathematical Challenge 16: What are the Symmetries and Action Principles for Biology?
  • Mathematical Challenge 17: Geometric Langlands and Quantum Physics
  • Mathematical Challenge 18: Arithmetic Langlands, Topology and Geometry
  • Mathematical Challenge 19: Settle the Riemann Hypothesis
  • Mathematical Challenge 20: Computation at Scale
  • Mathematical Challenge 21: Settle the Hodge Conjecture
  • Mathematical Challenge 22: Settle the Smooth Poincare Conjecture in Dimension 4
  • Mathematical Challenge 23: What are the Fundamental Laws of Biology?

(Details of each challenge omitted. See the webpage for descriptions.)

Worthy mathematical challenges all but what about a more modest challenge? One that may help solve a larger one?

Such as cutting across the terminology barriers of approaches and fields of mathematics to collate the prior, present and ongoing research on each of these challenges?

Not only would the curated artifact be useful to researchers, but the act of curation, the reading and mapping of what is known on a particular problem, could spark new approaches to the main problem as well.

DARPA should consider a history curation project on one or more of these challenges.

Could produce a useful information artifact for researchers, train math graduate students in searching across approaches/fields, and might trigger a creative insight into a possible challenge solution.

I first saw this at Beyond Search: DARPA May Be Hilbert

August 27, 2012

K-Nearest-Neighbors and Handwritten Digit Classification

Filed under: Classification,Clustering,K-Nearest-Neighbors — Patrick Durusau @ 6:36 pm

K-Nearest-Neighbors and Handwritten Digit Classification by Jeremy Kun.

From the post:

The Recipe for Classification

One important task in machine learning is to classify data into one of a fixed number of classes. For instance, one might want to discriminate between useful email and unsolicited spam. Or one might wish to determine the species of a beetle based on its physical attributes, such as weight, color, and mandible length. These “attributes” are often called “features” in the world of machine learning, and they often correspond to dimensions when interpreted in the framework of linear algebra. As an interesting warm-up question for the reader, what would be the features for an email message? There are certainly many correct answers.

The typical way of having a program classify things goes by the name of supervised learning. Specifically, we provide a set of already-classified data as input to a training algorithm, the training algorithm produces an internal representation of the problem (a model, as statisticians like to say), and a separate classification algorithm uses that internal representation to classify new data. The training phase is usually complex and the classification algorithm simple, although that won’t be true for the method we explore in this post.

More often than not, the input data for the training algorithm are converted in some reasonable way to a numerical representation. This is not as easy as it sounds. We’ll investigate one pitfall of the conversion process in this post, but in doing this we separate the data from the application domain in a way that permits mathematical analysis. We may focus our questions on the data and not on the problem. Indeed, this is the basic recipe of applied mathematics: extract from a problem the essence of the question you wish to answer, answer the question in the pure world of mathematics, and then interpret the results.

We’ve investigated data-oriented questions on this blog before, such as, “is the data linearly separable?” In our post on the perceptron algorithm, we derived an algorithm for finding a line which separates all of the points in one class from the points in the other, assuming one exists. In this post, however, we make a different structural assumption. Namely, we assume that data points which are in the same class are also close together with respect to an appropriate metric. Since this is such a key point, it bears repetition and elevation in the typical mathematical fashion. The reader should note the following is not standard terminology, and it is simply a mathematical restatement of what we’ve already said.

Modulo my concerns about assigning non-metric data to metric spaces, this is a very good post on classification.

HAIL – Only Aggressive Elephants are Fast Elephants

Filed under: Hadoop,Indexing — Patrick Durusau @ 6:26 pm

HAIL – Only Aggressive Elephants are Fast Elephants

From the post:

Typically we store data based on any one of the different physical layouts (such as row, column, vertical, PAX etc). And this choice determines its suitability for a certain kind of workload while making it less optimal for other kinds of workloads. Can we store data under different layouts at the same time? Especially within a HDFS environment where each block is replicated a few times. This is the big idea that HAIL (Hadoop Aggressive Indexing Library) pursues.

At a very high level it looks like to understand the working of HAIL we will have to look at the three distinct workflows the system is organized around namely –

  1. The data/file upload pipeline
  2. The indexing pipeline
  3. The query pipeline

Every unit of information makes its journey through these three pipelines.

Be sure to see the original paper.

How much of what we “know” about modeling is driven by the needs of ancestral storage layouts?

Given the performance of modern chips, are those “needs” still valid considerations?

Or perhaps better, at what size data store or processing requirement do the physical storage model needs re-assert themselves?

Not just a performance question but also one of uniformity of identification.

What was once a “performance” requirement, that data have some common system of identification, may not longer be the case.

POSIX Threads Programming

Filed under: Parallel Programming,POSIX Threads,Programming — Patrick Durusau @ 2:58 pm

POSIX Threads Programming by Blaise Barney, Lawrence Livermore National Laboratory

From the webpage:

In shared memory multiprocessor architectures, such as SMPs, threads can be used to implement parallelism. Historically, hardware vendors have implemented their own proprietary versions of threads, making portability a concern for software developers. For UNIX systems, a standardized C language threads programming interface has been specified by the IEEE POSIX 1003.1c standard. Implementations that adhere to this standard are referred to as POSIX threads, or Pthreads.

The tutorial begins with an introduction to concepts, motivations, and design considerations for using Pthreads. Each of the three major classes of routines in the Pthreads API are then covered: Thread Management, Mutex Variables, and Condition Variables. Example codes are used throughout to demonstrate how to use most of the Pthreads routines needed by a new Pthreads programmer. The tutorial concludes with a discussion of LLNL specifics and how to mix MPI with pthreads. A lab exercise, with numerous example codes (C Language) is also included.

Level/Prerequisites: This tutorial is one of the eight tutorials in the 4+ day “Using LLNL’s Supercomputers” workshop. It is deal for those who are new to parallel programming with threads. A basic understanding of parallel programming in C is required. For those who are unfamiliar with Parallel Programming in general, the material covered in EC3500: Introduction To Parallel Computing would be helpful.

The capacity for parallelism in computing has become commonplace. How well parallelism is being used?, is a much more difficult question.

Fortunately, exploration of parallelism isn’t limited to cloistered and carefully guarded CS installations. It is quite likely that the computer on your desk has some capacity for parallel processing.

Not enough to simulate the origin of the universe or an atomic bomb explosion but enough to learn the basics of parallelism. You may discover insights that have been overlooked by others.

Won’t know unless you try.

I first saw this at Christopher Lalanne’s A bag of tweets / August 2012.

PS: If you learn POSIX threads, you might want to consider mapping the terminology to vendor specific thread terminology.

Grinstead and Snell’s Introduction to Probability

Filed under: Mathematics,Probability — Patrick Durusau @ 2:37 pm

Grinstead and Snell’s Introduction to Probability

From the preface:

Probability theory began in seventeenth century France when the two great French mathematicians, Blaise Pascal and Pierre de Fermat, corresponded over two problems from games of chance. Problems like those Pascal and Fermat solved continued to influence such early researchers as Huygens, Bernoulli, and DeMoivre in establishing a mathematical theory of probability. Today, probability theory is a well-established branch of mathematics that finds applications in every area of scholarly activity from music to physics, and in daily experience from weather prediction to predicting the risks of new medical treatments.

This text is designed for an introductory probability course taken by sophomores, juniors, and seniors in mathematics, the physical and social sciences, engineering, and computer science. It presents a thorough treatment of probability ideas and techniques necessary for a firm understanding of the subject. The text can be used in a variety of course lengths, levels, and areas of emphasis.

What promises to be an entertaining and even literate book on probability.

I first saw this at Christopher Lalanne’s A bag of tweets / August 2012.

Hadoop on your PC: Cloudera’s CDH4 virtual machine

Filed under: Cloudera,Hadoop — Patrick Durusau @ 2:10 pm

Hadoop on your PC: Cloudera’s CDH4 virtual machine by Andrew Brust.

From the post:

Want to learn Hadoop without building your own cluster or paying for cloud resources? Then download Cloudera’s Hadoop distro and run it in a virtual machine on your PC. I’ll show you how.

As good a way as any to get your feet wet with Hadoop.

Don’t be surprised if in a week or two you have both the nucleus of a cluster and a cloud account. Reasoning that you need to be prepared for any client’s environment of choice.

Or at least that is what you will tell your significant other.

Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra

Filed under: Hadoop,HBase,JRuby,Pig — Patrick Durusau @ 2:01 pm

Pig as Hadoop Connector, Part Two: HBase, JRuby and Sinatra by Russell Jurney.

From the post:

Hadoop is about freedom as much as scale: providing you disk spindles and processor cores together to process your data with whatever tool you choose. Unleash your creativity. Pig as duct tape facilitates this freedom, enabling you to connect distributed systems at scale in minutes, not hours. In this post we’ll demonstrate how you can turn raw data into a web service using Hadoop, Pig, HBase, JRuby and Sinatra. In doing so we will demonstrate yet another way to use Pig as connector to publish data you’ve processed on Hadoop.

When (not if) the next big cache of emails or other “sensitive” documents drops, everyone who has followed this and similar tutorials should be ready.

Reasoning with the Variation Ontology using Apache Jena #OWL #RDF

Filed under: Bioinformatics,Jena,OWL,RDF,Reasoning — Patrick Durusau @ 1:46 pm

Reasoning with the Variation Ontology using Apache Jena #OWL #RDF by Pierre Lindenbaum.

From the post:

The Variation Ontology (VariO), “is an ontology for standardized, systematic description of effects, consequences and mechanisms of variations”.

In this post I will use the Apache Jena library for RDF to load this ontology. It will then be used to extract a set of variations that are a sub-class of a given class of Variation.

If you are interested in this example, you may also be interested in the Variation Ontology.

The VariO homepage reports:

VariO allows

  • consistent naming
  • annotation of variation effects
  • data integration
  • comparison of variations and datasets
  • statistical studies
  • development of sofware tools

It isn’t clear on a quick read, how VariO accomplishes:

  • data integration
  • comparison of variations and datasets

Unless it means uniform recordation using VariO enables “data integration,” and “comparison of variations and datasets?”

True but what nomenclature, uniformly used, does not enable “data integration,” and “comparison of variations and datasets?”

Is there one?

SIGIR 2013 : ACM International Conference on Information Retrieval

Filed under: Conferences,Information Retrieval — Patrick Durusau @ 1:20 pm

SIGIR 2013 : ACM International Conference on Information Retrieval

21 January 2013: Abstracts for full research papers due
28 January 2013: Full research paper due
4 February 2013: Workshop proposals due
18 February 2013: Posters, demonstration, and tutorial proposals due
11 March 2013: Notification of workshop acceptances
11 March 2013: Doctoral consortium proposals due
15 April 2013: All other acceptance notifications
28 July 2013: Conference Begins

From the webpage:

We are delighted to welcome SIGIR 2013 to Dublin, Ireland. SIGIR was last held in Dublin almost 20 years ago in 1994. The intervening years have seen huge growth in the field of information retrieval and we look forward to receiving submissions to help us build an exciting programme reporting latest developments in information retrieval.

Updates to follow but thought you might want extra time to plan for Dublin.

OAIR 2013 : Open Research Areas in Information Retrieval

Filed under: Conferences,Information Retrieval — Patrick Durusau @ 10:46 am

OAIR 2013 : Open Research Areas in Information Retrieval

When May 22, 2013 – May 24, 2013
Where Lisbon, Portugal
Submission Deadline Dec 10, 2012
Notification Due Feb 4, 2013

From the homepage:

Welcome to OAIR 2013 (the 10th International Conference in the RIAO series), taking place in Lisbon, Portugal from May 22 to 24, 2013.

The World Wide Web is the largest source of openly accessible data, and the most common means to connect people and share resources.

However, exploiting these interconnected Webs to obtain information is still an unsolved problem. This conference calls for papers describing recent research in Information Retrieval concerning the integration between a Web of Data and a Web of People, to transform pure data into information, and information into usable knowledge.

The Open research Areas in Information Retrieval (OAIR) conference is a triennial conference, addressing research topics related to the design of robust and large-scale scientific and industrial solutions to information processing.

OAIR 2013 conference is an opportunity to show main research activities, to share knowledge among IR scientific community and to get updates on new scientific work developed by IR community.

This conference is connected to the main IR personalities (see Steering Committee list) and a considerable number of attendances are expected.

We look forward to seeing you in the Europe´s Westernmost and sunniest capital, LISBON!

Topics of interest include:

  • Adapting search to Users
  • Advertising and ad targeting
  • Aggregation of Results
  • Community and Context Aware Search
  • Community-based Filtering and Recommender Systems
  • Community-based IR Theory
  • Community-oriented Content Representation
  • Evaluation of Social IR
  • Improving Web via Social Media
  • Including Crowdsourcing in Search
  • Merging Heterogeneous Web Data
  • Modeling the web of people
  • Personal semantics search
  • Query log analysis
  • Personal semantics search
  • Search over Social Networks
  • Sentiment analysis
  • Social Multimedia and Multimodal IR
  • Social Topic detection
  • Structuring Unstructured Data
  • System Architectures for Social IR
  • User Interfaces and Interactive IR

Having connections to data, assuming anyone knows its whereabouts, isn’t quite the same as making use of it.

Experts vs. Crowds (How to Distinguish, CIA and Drones)

Filed under: Crowd Sourcing,Intelligence — Patrick Durusau @ 9:22 am

Reporting on the intelligence community’s view of crowd-sourcing, Ken Dilanian reports:

“I don’t believe in the wisdom of crowds,” said Mark Lowenthal, a former senior CIA and State Department analyst (and 1988″Jeopardy!” champion) who now teaches classified courses about intelligence. “Crowds produce riots. Experts produce wisdom.”

I would modify Lowenthal’s assessment to read:

Crowds produce diverse judgements. Experts produce highly similar judgements.

Or to put it another way, the smaller the group, over time, the less variation you will find in opinion. And the further group opinion diverges from reality as experienced by non-group members.

No real surprise Beltway denizens failed to predict the Arab Spring. None of the concerns that led to the Arab Spring are part of the “experts” concerns. Not just on a conscious level but as a social experience.

The more diverse the opinion/experience pool, the less likely a crowd judgement is to be completely alien to reality as experienced by others.

Which is how I would explain the performance of the crowd thus far in the experiment.

Dilanian’s speculation:

Crowd-sourcing would mean, in theory, polling large groups across the 200,000-person intelligence community, or outside experts with security clearances, to aggregate their views about the strength of the Taliban, say, or the likelihood that Iran is secretly building a nuclear weapon.

reflects a failure to appreciate the nature of crowd-sourced judgements.

First, crowd-sourcing will be more effective if the “intelligence community” is only a small part of the crowd. To choose people only with security clearances I suspect automatically excludes many Taliban sympathizers. Not going to get good results if the crowd is poorly chosen.

Think of it as trying to re-create the “dance” that bees do as a means of communicating the location of pollen. I would trust the CIA to build a bee hive with only drones. And then complain that crowd behavior didn’t work.

Second, crowd-sourcing can do factual questions, like guessing the weight of an animal, but only if everyone has the same information. Otherwise, use crowd-sourcing to gauge the likely impact of policies, changes in policies, etc. Pulse of the “public” as it were.

The “likelihood that Iran is secretly building a nuclear weapon” isn’t a crowd-source question. No lack of information can counter the effort being “secret.” There is no information because, yes, Iran is keeping it secret.

Properly used, crowd-sourcing can be a very valuable tool.

The ad agencies call it public opinion polling.

Imagine appropriate polling activities on the ground in the Middle East. Asking ordinary people about their hopes, desires, and dreams. If credited over summarized and sanitized results of experts, could lead to policies that benefit the people, not to say the governments, of the Middle East. (Another reason some prefer experts. Experts support current governments.)

Los Angeles Times, in: U.S. intelligence tests crowd-sourcing against its experts.

August 26, 2012

MongoDB: Pumping Fractal Iron

Filed under: Fractal Trees,MongoDB — Patrick Durusau @ 5:46 pm

10x Insertion Performance Increase for MongoDB with Fractal Tree Indexes by Tim Callaghan.

From the post:

The challenge of handling massive data processing workloads has spawned many new innovations and techniques in the database world, from indexing innovations like our Fractal Tree® technology to a myriad of “NoSQL” solutions (here is our Chief Scientist’s perspective). Among the most popular and widely adopted NoSQL solutions is MongoDB and we became curious if our Fractal Tree indexing could offer some advantage when combined with it. The answer seems to be a strong “yes”.

Earlier in the summer we kicked off a small side project and here’s what we did: we implemented a “version 2” IndexInterface as a Fractal Tree index and ran some benchmarks. Note that our integration only affects MongoDB’s secondary indexes; primary indexes continue to rely on MongoDB’s indexing code. All the changes we made to the MongoDB source are available here. Caveat: this was a quick and dirty project – the code is experimental grade so none of it is supported or went through any careful design analysis.

For our initial benchmark we measured the performance of a single threaded insertion workload. The inserted documents contained the following: URI (character), name (character), origin (character), creation date (timestamp), and expiration date (timestamp). We created a total of four secondary indexes: URI, name, origin, and creation date. The point of the benchmark is to insert enough documents such that the indexes are larger than main memory and show the insertion performance from an empty database to one that is largely dependent on disk IO. We ran the benchmark with journaling disabled, then again with journaling enabled.

Not for production use but the performance numbers should give you pause.

A long pause.

HTML5, Apps and JavaScript Video

Filed under: HTML5,Javascript — Patrick Durusau @ 3:34 pm

HTML5, Apps and JavaScript Video by Brad Stenger.

From the post:

Here are the videos from last week’s TimesOpen event on HTML5, Apps and JavaScript:

If you missed the TimesOpen event on August 15th on HTML5, Apps and JavaScript, videos have been posted!

In case you are curious about good days to be in New York City in the near future, check out:

Times Open. (There’s some other reason to go to New York?)

Semantic University

Filed under: Semantic Web,Semantics — Patrick Durusau @ 2:18 pm

Semantic University

From the homepage:

Semantic University will be the single largest and most accessible source of educational material relating to semantic technologies. Moreover, it will fill several important gaps in current material by providing:

  • Lessons suitable to those brand new to the space.
  • Comparisons, both high-level and in-depth, with related technologies, such as NoSQL and Big Data.
  • Interactive, hands on tutorials.

Have you used these materials? Comparison to others?

Metric Spaces — A Primer [Semantic Metrics?]

Filed under: Distance,Metric Spaces,Semantics — Patrick Durusau @ 1:45 pm

Metric Spaces — A Primer by Jeremy Kun.

The Blessing of Distance

We have often mentioned the idea of a “metric” on this blog, and we briefly described a formal definition for it. Colloquially, a metric is simply the mathematical notion of a distance function, with certain well-behaved properties. Since we’re now starting to cover a few more metrics (and things which are distinctly not metrics) in the context of machine learning algorithms, we find it pertinent to lay out the definition once again, discuss some implications, and explore a few basic examples.

The most important thing to take away from this discussion is that not all spaces have a notion of distance. For a space to have a metric is a strong property with far-reaching mathematical consequences. Essentially, metrics impose a topology on a space, which the reader can think of as the contortionist’s flavor of geometry. We’ll explore this idea after a few examples.

On the other hand, from a practical standpoint one can still do interesting things without a true metric. The downside is that work relying on (the various kinds of) non-metrics doesn’t benefit as greatly from existing mathematics. This can often spiral into empirical evaluation, where justifications and quantitative guarantees are not to be found.

An enjoyable introduction to metric spaces.

Absolutely necessary for machine learning and computational tasks.

However, I am mindful that the mapping from semantics to a location in metric space is an arbitrary one. Our evaluations of metrics assigned to any semantic, are wholly dependent upon that mapping.

Not that we can escape that trap but to urge caution when claims are made on the basis of arbitrarily assigned metric locations. (A small voice should be asking: What if we change the assigned metric locations? What result then?)

Linked Legal Data: A SKOS Vocabulary for the Code of Federal Regulations

Filed under: Law,Law - Sources,Linked Data,SKOS — Patrick Durusau @ 1:17 pm

Linked Legal Data: A SKOS Vocabulary for the Code of Federal Regulations by Núria Casellas.

Abstract:

This paper describes the application of Semantic Web and Linked Data techniques and principles to regulatory information for the development of a SKOS vocabulary for the Code of Federal Regulations (in particular of Title 21, Food and Drugs). The Code of Federal Regulations is the codification of the general and permanent enacted rules generated by executive departments and agencies of the Federal Government of the United States, a regulatory corpus of large size, varied subject-matter and structural complexity. The CFR SKOS vocabulary is developed using a bottom-up approach for the extraction of terminology from text based on a combination of syntactic analysis and lexico-syntactic pattern matching. Although the preliminary results are promising, several issues (a method for hierarchy cycle control, expert evaluation and control support, named entity reduction, and adjective and prepositional modifier trimming) require improvement and revision before it can be implemented for search and retrieval enhacement of regulatory materials published by the Legal Information Institute. The vocabulary is part of a larger Linked Legal Data project, that aims at using Semantic Web technologies for the representation and management of legal data.

Considers use of nonregulatory vocabularies, conversion of existing indexing materials and finally settles on NLP processing of the text.

Granting that Title 21, Food and Drugs is no walk in the part, take a peek at the regulations for Title 26, Internal Revenue Code. 😉

A difficulty that I didn’t see mentioned is the changing semantics in statutory law and regulations.

The definition of “person,” for example, varies widely depending upon where it appears. Both chronologically and synchronically.

Moreover, if I have a nonregulatory vocabulary and/or CFR indexes, why shouldn’t that map to the CFR SKOS vocabulary?

I may not have the “correct” index but the one I prefer to use. Shouldn’t that be enabled?

I first saw this at Legal Informatics.

Patterns for research in machine learning

Filed under: Machine Learning — Patrick Durusau @ 11:03 am

Patterns for research in machine learning by S. M. Ali Eslami.

From the post:

There are a handful of basic code patterns that I wish I was more aware of when I started research in machine learning. Each on its own may seem pointless, but collectively they go a long way towards making the typical research workflow more efficient. Here they are:

Perhaps “good practices” or Harper’s incremental improvement. Either way, likely to be useful in topic maps research.

Designing a Better Sales Pipeline Dashboard

Filed under: Interface Research/Design,Marketing — Patrick Durusau @ 10:50 am

Designing a Better Sales Pipeline Dashboard by Zach Gemignani

From the post:

What would your perfect sales pipeline dashboard look like?

The tools that so effectively capture sales information (Salesforce, PipelineDeals, Highrise) tend to do a pretty lousy job of providing visibility into that very same data. The reporting or analytics is often just a table with lots of filtering features. That doesn’t begin to answer important questions like:

  • What is the value of the pipeline?
  • Where is it performing efficiently? Where is it failing?
  • How are things likely to change in the next month?

I’ve been annoyed by this deficiency in sales dashboards for a while. Ken and I put together some thoughts about what a better sales pipeline information interface would look like and how it would function. Here’s what we came up with:

A sales dashboard that at least two people like better than most offerings.

What would you add to this dashboard that topic maps would be able to supply?

Yes, I am divorcing the notion of “interface” from “topic map.”

Interface being how a user accomplishes a task or accesses information.

Completely orthogonal to the underlying technology.

Exposing the underlying technology demonstrates how clever we are.

Is not succeeding in the marketplace clever?*


*Ask yourself how many MS Office users can even stumble through a “big block” diagram of how MS Word works?

Compare that number to the number of MS Word users. Express as:

“MS Word users/MS Word users who understand the technology.”

That’s my target ratio for:

“topic map users/topic map users who understand the technology.”

At a conference? Need a dataset? Neo4j at NOSQL NOW

Filed under: Graphs,Neo4j — Patrick Durusau @ 9:56 am

At a conference? Need a dataset? Neo4j at NOSQL NOW

From the post:

For the “Lunch and Learn around Neo4j” with Andreas Kollegger we wanted to use a dataset that is easy to understand and interesting enough for attendees of the conference.

So we chose to use just that days conference program as dataset. Conference data is usually well connected and has the opportunity for challenging data model discussions and insightful queries.

Interesting exercise.

If while you are downloading the sample data you notice that “neo4j-jdbc-driver-demo-people-dataset.zip” is reported as “(0.00MB)” don’t despair.

It is a rounding artifact. The file is small but does have content (I checked).

Grant Seeking/Funding As Computer Science Activity

Filed under: CS Lectures,Funding — Patrick Durusau @ 9:36 am

Robert Harper writes in: Believing in Computer Science:

It’s not every day that I can say that I agree with Bertrand Meyer, but today is an exception. Meyer has written an opinion piece in the current issue of C.ACM about science funding that I think is worth amplying. His main point is that funding agencies, principally the NSF and the ERC, are constantly pushing for “revolutionary” research, at the expense of “evolutionary” research. Yet we all (including the funding agencies) know full well that, in almost every case, real progress is made by making seemingly small advances on what is already known, and that whether a body of research is revolutionary or not can only be assessed with considerable hindsight. Meyer cites the example of Hoare’s formulation of his logic of programs, which was at the time a relatively small increment on Floyd’s method for proving properties of programs. For all his brilliance, Hoare didn’t just invent this stuff out of thin air, he built on and improved upon the work that had gone before, as of course have hundreds of others built on his in turn. This all goes without saying, or ought to, but as Meyer points out, we computer scientists are constantly bombarded with direct and indirect exhortations to abandon all that has gone before, and to make promises that no one can honestly keep.

Meyer’s rallying cry is for incrementalism. It’s a tough row to hoe. Who could possibly argue against fostering earth-shattering research that breaks new ground and summarily refutes all that has gone before? And who could possibly defend work that is obviously just another slice of the same salami, perhaps with a bit of mustard this time? And yet what he says is obviously true. Funding agencies routinely beg the very question under consideration by stipulating a priori that there is something wrong with a field, and that an entirely new approach is required. With all due respect to the people involved, I would say that calls such as these are both ill-informed and outrageously arrogant.

What Harper and Meyer write is true, but misses a critical point.

To illustrate: What do you think would happen if one or more of the “impossible” funding proposals succeeded?

Consider the funding agency and its staff. If even one of its perennial funding requests were to succeed, what would it replace it with for next time? Can’t have a funding apparatus, with clerks, rule books, procedures, judges, etc., without some problem to be addressed. Solving any sufficiently large problem would be a nightmare for a funding agency.

On the par with the March of Dimes solving the problem of polio. It had the choice of finding a new mission or dissolving. Can you imagine a funding agency presented with that choice?

CS funding agencies avoid that dilemma by funding research that by definition is very unlikely to succeed.

And what of the grant seekers?

What if they can only accept graduate students who can solve nearly impossible CS problems? Would not have a very large department with that as a limitation. And consequently very small budgets, limited publication venues, conferences, etc.

I completely agree with Harper and Meyers but CS departments should start teaching grant seeking/funding as a CS activity.

Perhaps even a Masters of CS/Grants&Funding? (There may be one already, check your local course catalog.)

“Real” CS will proceed incrementally, but then it always has.

I retained the link in Robert’s post but you should forward, Long Live Incremental Research!, http://cacm.acm.org/blogs/blog-cacm/109579-long-live-incremental-research/fulltext, so your non-ACM friends can enjoy the Meyer’s post.

Index your blog using tags and lucene.net

Filed under: .Net,Lucene — Patrick Durusau @ 4:56 am

Index your blog using tags and lucene.net by Ricci Gian Maria.

From the post:

In the last part of my series on Lucene I show how simple is adding tags to document to do a simple tag based categorization, now it is time to explain how you can automate this process and how to use some advanced characteristic of lucene. First of all I write a specialized analyzer called TagSnowballAnalyzer, based on standard SnowballAnalyzer plus a series of keywords associated to various tags, here is how I construct it.

There are various code around the net on how to add synonyms with weight, like described in this stackoverflow question, standard java lucene code has a SynonymTokenFilter in the codebase, but this example shows how simple is to write a Filter to add tags as synonym of related words.   First of all the filter was initialized with a dictionary of keyword and Tags, where Tag is a simple helper class that stores Tag string and relative weight, it also have a ConvertToToken() method that returns the tag enclosed by | (pipe) character. The use of pipe character is done to explicitly mark tags in the token stream, any word that is enclosed by pipe is by convention a tag.

Not the answer for every situation involving synonymy (as in “same subject,” i.e., topic maps) but certainly a useful one.

August 25, 2012

FragVLib a free database mining software for generating “Fragment-based Virtual Library” using pocket similarity…

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:13 pm

FragVLib a free database mining software for generating “Fragment-based Virtual Library” using pocket similarity search of ligand-receptor complexes Raed Khashan. Journal of Cheminformatics 2012, 4:18 doi:10.1186/1758-2946-4-18.

Abstract:

Background

With the exponential increase in the number of available ligand-receptor complexes, researchers are becoming more dedicated to mine these complexes to facilitate the drug design and development process. Therefore, we present FragVLib, free software which is developed as a tool for performing similarity search across database(s) of ligand-receptor complexes for identifying binding pockets which are similar to that of a target receptor.

Results

The search is based on 3D-geometric and chemical similarity of the atoms forming the binding pocket. For each match identified, the ligand’s fragment(s) corresponding to that binding pocket are extracted, thus, forming a virtual library of fragments (FragVLib) that is useful for structure-based drug design.

Conclusions

An efficient algorithm is implemented in FragVLib to facilitate the pocket similarity search. The resulting fragments can be used for structure-based drug design tools such as Fragment-Based Lead Discovery (FBLD). They can also be used for finding bioisosteres and as an idea generator.

Suggestions of other uses of 3D-geometric shapes for similarity?

Clojure/Datomic creator Rich Hickey on Deconstructing the Database

Filed under: Clojure,Datomic — Patrick Durusau @ 4:33 pm

Clojure/Datomic creator Rich Hickey on Deconstructing the Database

From the description:

Rich Hickey, author of Clojure, and designer of Datomic presents a new way to look at database architectures in this talk from JaxConf 2012. What happens when you deconstruct the traditional monolithic database – separating transaction processing, storage and query into independent cooperating services? Coupled with a data model based around atomic facts and awareness of time, you get a significantly different set of capabilities and tradeoffs. This talk with discuss how these ideas play out in the design and architecture of Datomic, a new database for the JVM.

I truly appreciate the description of database updates as “a miracle occurs.”

There is much to enjoy and consider here.

The State of Hawai’i Demands a New Search Engine

Filed under: GIS,Maps,Searching — Patrick Durusau @ 4:06 pm

The State of Hawai’i Demands a New Search Engine by Matthew Hurst.

Matthew writes:

We will soon be embarking on a short trip to Hawai’i. Naturally, I’m turning to search engines to find out about the best beaches to go to. However, it turns out that this simple problem – where to go on vacation – is terribly under supported by today’s search engines.

Firstly, there is the problem with the Web Proposition. The web proposition – the reason for traditional web search engines to exist at all – states that there is a page containing the information you seek somewhere online. While there are many pages that list the ‘best beaches in Hawai’i’ as the analysis below demonstrates these are just sets of opinions – often very different in nature. An additional problem with the Web Proposition is that information and monetization don’t always align. Many of the ‘best’ beaches pages are really channels through which hotel and real estate commerce is done. Thus a balance is needed between objective information and commercial interests.

Secondly, beaches are not considered local entities by search engines. While the query {beaches in kauai} is very similar in form to the query {restaurants in kauai} the later generates results of entities of type while the former generates results of entities of type . While local search sounds like search over entities which have location, it is largely limited to local entities with commercial intent.

Finally, there is general confusion due to the fact that the state of Hawai’i contains a sub-region (an island) called Hawai’i.

As you may have guessed, had Matthew’s searches been successful, there would be no blog post.

How would you use topic maps to solve the shortfalls that Matthew identifies?

What other content would you aggregate with beaches?

Data Mining Blogs

Filed under: Data Mining — Patrick Durusau @ 3:58 pm

Data Mining Blogs by Sandro Saitta.

From the post:

I posted an earlier version of this data mining blog list in a previously on DMR. Here is an updated version (blogs recently added to the list have the logo “new”). I will keep this version up-to-date. You can access it at any time from the DMR top bar. Here is a link to the OPML version. If you know a data mining blog that is not in this list, feel free to post a comment so I can add the link. Also, if you see any broken link, please mention it.

Consider this a starter set of locators for a custom web crawl on data mining.

RSS feeds are great, but only for current content.

Algebraic Topology and Machine Learning

Filed under: Machine Learning,Topological Data Analysis,Topology — Patrick Durusau @ 2:58 pm

Algebraic Topology and Machine Learning – In conjunction with Neural Information Processing Systems (NIPS 2012)

September 16, 2012 – Submissions Due
October 7, 2012 – Acceptance Notices
December 7 or 8 (TBD), 2011, Lake Tahoe, Nevada, USA.

From the call for papers:

Topological methods and machine learning have long enjoyed fruitful interactions as evidenced by popular algorithms like ISOMAP, LLE and Laplacian Eigenmaps which have been borne out of studying point cloud data through the lens of topology/geometry. More recently several researchers have been attempting to understand the algebraic topological properties of data. Algebraic topology is a branch of mathematics which uses tools from abstract algebra to study and classify topological spaces. The machine learning community thus far has focused almost exclusively on clustering as the main tool for unsupervised data analysis. Clustering however only scratches the surface, and algebraic topological methods aim at extracting much richer topological information from data.

The goals of this workshop are:

  1. To draw the attention of machine learning researchers to a rich and emerging source of interesting and challenging problems.
  2. To identify problems of interest to both topologists and machine learning researchers and areas of potential collaboration.
  3. To discuss practical methods for implementing topological data analysis methods.
  4. To discuss applications of topological data analysis to scientific problems.

We also invite submissions in a variety of areas, at the intersection of algebraic topology and learning, that have witnessed recent activity. Areas of focus for submissions include but are not limited to:

  1. Statistical approaches to robust topological inference.
  2. Novel applications of topological data analysis to problems in machine learning.
  3. Scalable methods for topological data analysis.

NIPS2012 site. You will appreciate the “dramatization.” 😉

Put on your calendar and/or watch for papers!

Machine Learning [Andrew Ng]

Filed under: CS Lectures,Machine Learning — Patrick Durusau @ 2:44 pm

Machine Learning [Andrew Ng]

The machine learning course by Andrew Ng started up on 20 August 2012, so there is time to enroll and catch up.

From the post:

What Is Machine Learning?

Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI. In this class, you will learn about the most effective machine learning techniques, and gain practice implementing them and getting them to work for yourself. More importantly, you’ll learn about not only the theoretical underpinnings of learning, but also gain the practical know-how needed to quickly and powerfully apply these techniques to new problems. Finally, you’ll learn about some of Silicon Valley’s best practices in innovation as it pertains to machine learning and AI.

About the Course

This course provides a broad introduction to machine learning, datamining, and statistical pattern recognition. Topics include: (i) Supervised learning (parametric/non-parametric algorithms, support vector machines, kernels, neural networks). (ii) Unsupervised learning (clustering, dimensionality reduction, recommender systems, deep learning). (iii) Best practices in machine learning (bias/variance theory; innovation process in machine learning and AI). The course will also draw from numerous case studies and applications, so that you’ll also learn how to apply learning algorithms to building smart robots (perception, control), text understanding (web search, anti-spam), computer vision, medical informatics, audio, database mining, and other areas.

« Newer PostsOlder Posts »

Powered by WordPress