December « 2011 « Another Word For It

December 6, 2011

Game Theory

Filed under: CS Lectures,Game Theory,Games — Patrick Durusau @ 8:04 pm

Game Theory by Matthew Jackson and Yoav Shoham.

Another Stanford course for the Spring of 2012!

From the description:

Popularized by movies such as “A Beautiful Mind”, game theory is the mathematical modeling of strategic interaction among rational (and irrational) agents. Beyond what we call ‘games’ in common language, such as chess, poker, soccer, etc., it includes the modeling of conflict among nations, political campaigns, competition among firms, and trading behavior in markets such as the NYSE. How could you begin to model eBay, Google keyword auctions, and peer to peer file-sharing networks, without accounting for the incentives of the people using them? The course will provide the basics: representing games and strategies, the extensive form (which computer scientists call game trees), Bayesian games (modeling things like auctions), repeated and stochastic games, and more. We’ll include a variety of examples including classic games and a few applications.

Just in time for an election year so you will be able to model what you think is rational or irrational behavior on the part of voters in the U.S. 😉

The requirements:

You must be comfortable with mathematical thinking and rigorous arguments. Relatively little specific math is required; you should be familiar with basic probability theory (for example, you should know what a conditional probability is) and with basic calculus (for instance, taking a derivative).

For those of you not familiar with game theory, I think the course will be useful in teaching you a different way to view the world. Not necessary more or less accurate than other ways, just different.

Being able to adopt a different world view and see its intersections with other world views is a primary skill in crossing domain borders for new insights or information. The more world views you learn, the better you may become at seeing intersections of world views.

Comments Off

Neo4j Koans – How do I begin?

Filed under: Neo4j — Patrick Durusau @ 8:03 pm

Neo4j Koans – How do I begin?

An excellent post documenting how to go about setting up and then doing the Neo4j koans. Documentation that did not accompany the Koan files.

In their defense, a little anyway, I think the authors are assuming we know as much about the project as they do. Hence they only document what they need help remembering. Which we need too but most of us could use a bit more documentation than the software creators.

In true open source fashion, Michael has stepped up and supplied the missing documentation and more with this post!

Onward to the Koans!

Comments Off

IR – Foundation?

Filed under: Information Retrieval,Semantics — Patrick Durusau @ 8:02 pm

I find find the following statement troubling. See if you can see what’s missing from:

In terms of research, the area may be studied from two rather distinct and complementary points of view: a computer-centered one and a human-centered one. In the computer-centered view, IR consists mainly of building up efficient indexes, processing user queries with high performance, and developing ranking algorithms to improve the results. In the human-centered view, IR consists mainly of studying the behavior of the user, understanding their main needs, and of determining how such understanding affects the organization and operation of the retrieval system. In this book, we focus mainly on the computer-centered view of IR, which is dominant in academia and in the market place. (page 1, Modern Information Retrieval, 2nd ed., Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Pearson 2011)

I am not challenging the accuracy of the statement. Although I might explain some of it differently from the authors.

The terminology by which computer-centered IR is described is one clue: “….efficient…, ….high performance, ….improve the results.” That is computer-centered IR is mostly concerned with measurable results. Things to which we can put numbers and rank one as higher than others. Nothing wrong with that. Personally I have a great deal of interest in such approaches.

Human-centered IR is said: “….behavior…, ….needs, ….understanding….organization and operation….” Human-centered IR is mostly concerned with how users perform IR. Not as measurable but just as important as computer-centered IR. The authors point out, computer-centered IR dominates in academia and in the market place. I suspect because what can be easily measured is more attractive.

Do you notice something missing yet?

I thought it was quite remarkable that semantics weren’t mentioned. That is whatever computer or human centered approaches you take, the efficacy of those are going to vary by the semantics of the language on which IR is being performed. If that seems like an odd claim, consider the utility of an IR system that does not properly sort European much less Asian words, whether written in their scripts or transliteration.

True enough, we can make an IR system that is very fast that simply ignores the correct sort orders for such languages and in the past have taught readers of such languages to accept what the IR system was providing. So the behavior of the users was adapted to the systems. Human-centered I suppose but not the way I usually think about it.

And, after all, semantics are the reason we want to do IR in the first place. If the contents we were searching had no semantics, it is very unlikely we would want to search them at all. No matter now efficient or well organized a system might be.

My real concern is that semantics are being assumed as a matter of course. We all “know” the semantics. Hardly worth discussing. But that is why search results so seldom meet our expectations. We didn’t discuss the semantics up front. Everyone from system architect, programmer, UI designer, content author, all the way to and including the searcher, “knew” the semantics.

Trouble is, the semantics they “know,” are often different.

Of course the authors are free to include or exclude any content they wish and to fully cover semantic issues in general, would require a volume at least as long as this one. (A little over 900 pages with the index.)

I would start with something like:

to make the point that we always start with languages and semantics and that data/texts are recorded in systems using languages and semantics. Our data structures are not neutral bystanders. They determine as much of what we can find as they determine the how we will interpret it.

Try running a modern genealogy for someone and when you find an arrest record for being a war criminal or child molester of a close relative, see if the family wants that included. Suddenly that will be more important that other prizes or honors they have won. Still the same person but the label on the data, arrest record, makes us suspect the worse. Had it read: “False Arrests, a record of false charges during the regime of XXX,” we are likely to react differently.

I am going to use Baeza-Yates and Ribeiro-Neto as one of the required texts in the next topic maps class. So we can cover some of the mining techniques that will help populate topic maps.

But I will also cover the issue of languages/semantics as well as data/texts (in how they are stored and the semantics of the same).

Does anyone have a favorite single volume on languages/semantics. I would lean towards Doing What Comes Naturally by Stanley Fish but I am sure there are other volumes equally as good.

The data/text formats an their semantics is likely to be harder to come by. I don’t know of anything off hand that is focused on that in monograph length treatment. Suggestions?

PS: I know I got the image wrong but I am about to post. I will post a slightly amended image tomorrow when I have thought about it some more.

Don’t let that deter you from posting criticisms of the current image in the meantime.

Comments Off

Translation Memory

Filed under: Memory,Translation,Translation Memory — Patrick Durusau @ 7:59 pm

Translation Memory

As we mentioned in Teaching Etsy to Speak a Second Language, developers need to tag English content so it can be extracted and then translated. Since we are a company with a continuous deployment development process, we do this on a daily basis and as an result get a significant number of new messages to be translated along with changes or deletions of existing ones that have already been translated. Therefore we needed some kind of recollection system to easily reuse or follow the style of existing translations.

A translation memory is an organized collection of text extracted from a source language with one or more matching translations. A translation memory system stores this data and makes it easily accessible to human translators in order to assist with their tasks. There’s a variety of translation memory systems and related standards in the language industry. Yet, the nature of our extracted messages (containing relevant PHP, Smarty, and JavaScript placeholders) and our desire to maintain a translation style curated by a human language manager made us develop an in-house solution.

Go ahead, read the rest of the post, I’ll wait.

Interesting yes?

What if the title of my post were identification memory?

Not really that much difference between translation language to language and identification to identification, where we are talking about the same subject.

Hardly any difference at all when you think about it.

I am sure your current vendors will assure you their methods of identification are the best and they may be right. But on the other hand, they may also be wrong.

And there always is the issues of other data sources that have chosen to identify the same subjects differently. Like your company down the road, say five years from now. Preparing now for that “translation” project in the not too distant future, may save you from losing critical information down the road.

Preserving access to critical data is a form of translation memory. Yes?

Comments Off

Introducing Shep

Filed under: Hadoop,Shep,Splunk — Patrick Durusau @ 7:57 pm

Introducing Shep

From the post:

These are exciting times at Splunk, and for Big Data. During the 2011 Hadoop World, we announced our initiative to combine Splunk and Hadoop in a new offering. The heart of this new offering is an open source component called Shep. Shep is what will enable seamless two-way data-flow across the the systems, as well as opening up two-way compute operations across data residing in both systems.

Use Cases

The thing that intrigues us most is the synergy between Splunk and Hadoop. The ways to integrate are numerous, and as the field evolves and the project progresses, we can see more and more opportunities to provide powerful solutions to common problems.

Many of our customers are indexing terabytes per day, and have also spun up Hadoop initiatives in other parts of the business. Splunk integration with Hadoop is part of a broader goal at Splunk to break down barriers to data-silos, and open them up to availability across the enterprise, no matter what the source. To itemize some categories we’re focused on, listed here are some key use cases:

Query both Splunk and Hadoop data, using Splunk as a “single-pane-of-glass”

Data transformation utilizing Splunk search commands

Real-time analytics of data streams going to mutliple destinations

Splunk as data warehouse/marts for targeted exploration of HDFS data

Data acquisition from logs and apis via Splunk Universal Forwarder

Read the post to learn the features that are supported now or soon will be in Shep.

Now in private beta but it sounds worthy of a “heads up!”

Comments Off

December 5, 2011

Automated Security Remediation On The Rise

Filed under: Security — Patrick Durusau @ 7:54 pm

Automated Security Remediation On The Rise

APTs and other types of sophisticated attacks are undoubtedly changing information security processes, technologies, and skills, but ESG found another interesting transition in progress: Given the volume, sophistication, and surreptitious nature of APTs, large organizations are apparently willing to adopt more automated security technologies as a means for attack remediation. ESG’s recently published research report on APTs indicates that 20% of enterprises believe this development will happen “to a great extent” while another 54% say this will happen “to some extent.” (See this link for more information about the ESG Research Report, U.S. Advanced Persistent Threat Analysis).

I think this was the link omitted from the article: http://www.enterprisestrategygroup.com/2011/11/apt/

Guess what the #2 requirement was?:

Reputation data must play a role. Aside from internal network analysis, security intelligence must understand if a source/destination IP address, URL, application, DNS record, or file is known to be suspicious or malicious. Reputation data from Blue Coat, Check Point, Cisco, and Trend Micro must be part of the mix.

Err, how about who owns the IP address, DNS record, etc. and links to information on them?

Assume you as sitting on reports from credit reporting agencies (domestic intelligence agencies), DNS records and information from “other” sources that you can sell upstream to enterprise security vendors. Does that sound like a topic map based startup to you?

Comments Off

What is the “hashing trick”?

Filed under: Hashing,Machine Learning — Patrick Durusau @ 7:53 pm

What is the “hashing trick”?

I suspect this question:

I’ve heard people mention the “hashing trick” in machine learning, particularly with regards to machine learning on large data.

What is this trick, and what is it used for? Is it similar to the use of random projections?

(Yes, I know that there’s a brief page about it here. I guess I’m looking for an overview that might be more helpful than reading a bunch of papers.)

comes up fairly often. The answer given is unusually helpful so I wanted to point it out here.

Comments Off

Released OrientDB v1.0rc7

Filed under: Graphs,NoSQL,OrientDB — Patrick Durusau @ 7:51 pm

Released OrientDB v1.0rc7: Improved transactions and first Multi-Master replication (alpha)

From the post:

Hi all, after about 2 months a new release is available for all: OrientDB 1.0rc7.

OrientDB embedded and server: http://code.google.com/p/orient/downloads/detail?name=orientdb-1.0rc7.zip
OrientDB Graph(ed): http://code.google.com/p/orient/downloads/detail?name=orientdb-graphed-1.0rc7.zip

According to the community answer this release should contains the new management of links using the OMVRB-Tree, but it’s in alpha stage yet and will be available in this week as 1.0rc8-SNAPSHOT. I preferred to release something really stable with all the 34 issues fixed (more below) till now. Furthermore tomorrow the TinkerPop team will release the new version of its amazing technology stack (Blueprints, Gremlin, etc.) and we couldn’t miss the chance to include latest release of OrientDB with it, don’t you?

Thanks to all the contributors every weeks more!

Changes

Transactions: Improved speed, up to 500x! (issue 538)

New Multi-Master replication (issue 589). Will be final in the next v1.0

SQL insert supports MAP syntax (issue 582), new date() function

HTTP interface: JSONP support (issue 587), new create database (issue 566), new import/export database (issue 567, 568)

Many bugs fixed, 34 issues in total

Full list: http://code.google.com/p/orient/issues/list?can=1&q=label%3Av1.0rc7

Thanks Luca!

Comments Off

US intelligence group seeks Machine Learning breakthroughs

Filed under: Funding,Intelligence — Patrick Durusau @ 7:50 pm

US intelligence group seeks Machine Learning breakthroughs

From the post:

Machine Learning technology is found in everything from spam detection programs to intelligent thermostats, but can the technology make a huge leap to handle the exponentially larger amounts of information and advanced applications of the future?

Researchers from the government’s cutting edge research group, the Intelligence Advanced Research Projects Activity (IARPA), certainly hope so and this week announced that they are looking to the industry for new ideas that may become the basis for cutting edge Machine Learning projects.

Read more: From Anonymous to Hackerazzi: The year in security mischief-making

From IARPA: The focus of our request for information is on recent advances toward automatic machine learning, including automation of architecture and algorithm selection and combination, feature engineering, and training data scheduling for usability by non-experts, as well as scalability for handling large volumes of data. Machine Learning is used extensively in application areas of interest including speech, language, vision, sensor processing and the ability to meld that data into a single, what IARPA calls multi-modal system.

“In many application areas, the amount of data to be analyzed has been increasing exponentially (sensors, audio and video, social network data, web information) stressing even the most efficient procedures and most powerful processors. Most of these data are unorganized and unlabeled and human effort is needed for annotation and to focus attention on those data that are significant,” IARPA stated.

This could be interesting, depending on how you developed the interface. What if the system actually learned from its users while it was being used? So that not only did it provide faster access to more accurate information, it “learned” how to better do its job from the analysts using the software.

Especially if part of that “learning” was on what basis to merge information from disparate sources.

Note: Responses to the RFI are due by 27 January 2012.

Comments Off

Graphs Ensure Something from Nothing

Filed under: Graphs — Patrick Durusau @ 7:49 pm

Graphs Ensure Something from Nothing by Marko Rodriguez.

From the post:

There is a reason that there is something as opposed to nothing. However, it is simple for something to be nothing if that something is but one thing. One thing could be nothing even if it is not no thing. For that one thing must be both that which is called something and nothing.

I’m not sure about this one. But, I pass in on just in case it means something to you. 😉 Sorry!

Comments Off

Talend 5

Filed under: Data Governance,Data Integration — Patrick Durusau @ 7:48 pm

Talend 5

Talend 5 consists of:

Talend Open Studio for Data Integration (formerly Talend Open Studio), the most widely used open source data integration/ETL tool in the world.

Talend Open Studio for Data Quality (formerly Talend Open Profiler), the only open source enterprise data profiling tool.

Talend Open Studio for MDM (formerly Talend MDM Community Edition), the first – and only – open source MDM solution.

Talend Open Studio for ESB (formerly Talend ESB Studio Standard Edition), the easy to use open source ESB based on leading Apache Software Foundation integration projects.

From BusinessWire article.

Massive file downloads running now.

Are you using Talend? Thoughts/suggestions on testing/comparisons?

Comments (2)

Sharing and Integrating Ontologies

Filed under: Logic,Ontology — Patrick Durusau @ 7:44 pm

Sharing and Integrating Ontologies

Peter Yim, organizer and promoter of all things ontological, reminded me of this effort quite recently.

If you were constrained by:

The semantics defined by ISO/IEC 24707 for Common Logic should be the basis for the logics used to define ontologies.

could you still write a topic map?

My suggestion would be yes, since a topic map is “without form and void” prior to being filled in by an author.

True, prior to doing that “filling in,” you will have selected a form to fill in, that is a data model (we call them legends) so your topic map already has some rules and topics in place before you start.

But, the fact remains you could write a topic map using the semantics of ISO/IEC 24707 and identify those semantics so that ontologies could be mapped to them.

Comments Off

Medical Text Indexer (MTI)

Filed under: Bioinformatics,Biomedical,Indexing — Patrick Durusau @ 7:42 pm

Medical Text Indexer (MTI) (formerly the Indexing Initiative System (IIS))

From the webpage:

The MTI system consists of software for applying alternative methods of discovering MeSH headings for citation titles and abstracts and then combining them into an ordered list of recommended indexing terms. The top portion of the diagram consists of three paths, or methods, for creating a list of recommended indexing terms: MetaMap Indexing, Trigrams and PubMed Related Citations. The first two paths actually compute UMLS Metathesaurus® concepts which are passed to the Restrict to MeSH process. The results from each path are weighted and combined using the Clustering process. The system is highly parameterized not only by path weights but also by several parameters specific to the Restrict to MeSH and Clustering processes.

A prototype MTI system described below had two additional indexing methods which were removed because their results were subsumed by the three remaining methods.

Deeply interesting and relevant work to topic maps.

Comments Off

MetaMap Portal

Filed under: Bioinformatics,Biomedical,MetaMap,Metathesaurus — Patrick Durusau @ 7:41 pm

MetaMap Portal

About MetaMap:

MetaMap is a highly configurable program developed by Dr. Alan (Lan) Aronson at the National Library of Medicine (NLM) to map biomedical text to the UMLS Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge-intensive approach based on symbolic, natural-language processing (NLP) and computational-linguistic techniques. Besides being applied for both IR and data-mining applications, MetaMap is one of the foundations of NLM’s Medical Text Indexer (MTI) which is being used for both semiautomatic and fully automatic indexing of biomedical literature at NLM. For more information on MetaMap and related research, see the SKR Research Information Site.

Improvement in the October 2011 Release:

MetaMap2011 includes some significant enhancements, most notably algorithmic improvements that enable MetaMap to very quickly process input text that had previously been computationally intractable.

These enhancements include:

Algorithmic Improvements

Candidate Set Pruning

Re-Organization of Additional Data Models

Single-character alphabetic tokens

Improved Treatment of Apostrophe-“s”

New XML Command-Line Options

Numbered Mappings

User-Defined Acronyms and Abbreviations

Starting with MetaMap 2011, MetaMap is now available for Windows XP and Windows 7.

One of several projects that sound very close to being topic map mining programs.

Comments Off

MTI ML

Filed under: Machine Learning — Patrick Durusau @ 4:31 pm

MTI ML

From the webpage:

This package provides machine learning algorithms optimized for large text categorization tasks and is able to combine several text categorization solutions. The advantages of this package compared to existing approaches are: 1) its speed, 2) it is able to work with a large number of categorization problems and, 3) it provides the ability to compare several text categorization tools based on meta-learning. This website describes how to download, install and run MTI ML. An example data set is provided to verify the installation of the tool. More detailed instructions on using the tool are available here.

As usual with NIH projects, high quality work, lots of data.

Comments Off

December 4, 2011

Clustering Large Attributed Graphs: An Efficient Incremental Approach

Filed under: Algorithms,Clustering,Graphs — Patrick Durusau @ 8:19 pm

Clustering Large Attributed Graphs: An Efficient Incremental Approach by Yang Zhou, Hong Cheng, and Jeffrey Xu Yu. (PDF file)

Abstract:

In recent years, many networks have become available for analysis, including social networks, sensor networks, biological networks, etc. Graph clustering has shown its effectiveness in analyzing and visualizing large networks. The goal of graph clustering is to partition vertices in a large graph into clusters based on various criteria such as vertex connectivity or neighborhood similarity. Many existing graph clustering methods mainly focus on the topological structures, but largely ignore the vertex properties which are often heterogeneous. Recently, a new graph clustering algorithm, SA-Cluster, has been proposed which combines structural and attribute similarities through a unified distance measure. SACluster performs matrix multiplication to calculate the random walk distances between graph vertices. As the edge weights are iteratively adjusted to balance the importance between structural and attribute similarities, matrix multiplication is repeated in each iteration of the clustering process to recalculate the random walk distances which are affected by the edge weight update.

In order to improve the efficiency and scalability of SA-Cluster, in this paper, we propose an efficient algorithm Inc-Cluster to incrementally update the random walk distances given the edge weight increments. Complexity analysis is provided to estimate how much runtime cost Inc-Cluster can save. Experimental results demonstrate that Inc-Cluster achieves significant speedup over SA-Cluster on large graphs, while achieving exactly the same clustering quality in terms of intra-cluster structural cohesiveness and attribute value homogeneity.

Seeing this reminded me that I need to review the other papers presented at the 2010 IEEE International Conference on Data Mining. The problem is that papers that seem the most relevant at one time, six months later don’t seem as relevant as they once did. Same papers, same person looking at them. Passage of time and other papers I suspect.

Graph algorithms continue to improve, if you are working with large graphs, suggest you give some time to this paper.

Comments Off

Corpus of Erotica Stories

Filed under: Erotica,Text Corpus — Patrick Durusau @ 8:18 pm

Corpus of Erotica Stories from InfoChimps.

From the webpage:

Excellent resource for working with natural language processing and machine learning. This corpus consists of 4771 raw text erotica stories collected from www.textfiles.com/sex/EROTICA. A logical flow from the encouragement of writing on BBSes, people have been writing some form of erotica or sexual narrative for others for quite some time. With the advent of Fidonet and later Usenet, these stories achieved wider and wider distribution. Unfortunately, the nature of erotica is that it is often uncredited, undated, and hard to fix in time. As a result, you might be looking at stories much older or much newer than you might think.

Well, you have been looking for an interesting text for NLP and machine learning. Here’s your chance.

The subjects just abound.

One imagines the same could be done with an appropriate Twitter stream and writing it to a file.

Comments Off

Math Documentaries

Filed under: Mathematics — Patrick Durusau @ 8:18 pm

Math Documentaries

Thirty-six documentaries about mathematics.

Question: If compelling and interesting documentaries can be made about mathematics, why don’t we have a collection of documentaries about semantics, subject identity and similar topics?

Or are such documentaries out there and I have simply overlooked them? (Entirely possible since I don’t as a rule watch much TV.)

Suggestions/comments?

Oh, I didn’t list this simply to complain about the lack of semantic documentaries, I think this are good to recommend, particularly to young people. Understanding when math is being used to lie is as important a skill as knowing mathematics, if not more so.

Comments Off

ParaView

Filed under: Graphics,Visualization — Patrick Durusau @ 8:18 pm

Paraview

From the homepage:

ParaView is an open-source, multi-platform data analysis and visualization application. ParaView users can quickly build visualizations to analyze their data using qualitative and quantitative techniques. The data exploration can be done interactively in 3D or programmatically using ParaView’s batch processing capabilities.

ParaView was developed to analyze extremely large datasets using distributed memory computing resources. It can be run on supercomputers to analyze datasets of terascale as well as on laptops for smaller data.

You can see a summary of its features. I didn’t see it until late in the weekend but even if I had seen it early last week, I would not have had time to summarize its features or to point out which ones are the most relevant to topic maps.

I am going to install the software and work my way through it and pop up on occasion with updates. Feel free to contribute your insights.

Comments Off

Translating math into code with examples in Java, Racket, Haskell and Python

Filed under: Haskell,Java,Mathematics,Python — Patrick Durusau @ 8:17 pm

Translating math into code with examples in Java, Racket, Haskell and Python by Matthew Might.

Any page that claims Okasaki’s Purely Functional Data Structures as an “essential reference” has to be interesting.

And…, it turns out to be very interesting!

If I have a complaint, it is that it ended too soon! See what you think.

Comments Off

All the software a geoscientist needs. For free!

Filed under: Geo Analytics,Geographic Data,Geographic Information Retrieval — Patrick Durusau @ 8:17 pm

All the software a geoscientist needs. For free! by John A. Stevenson.

It is quite an impressive list and what’s more, John has provided a script to install it on a Linux machine.

If you any mapping or geoscience type needs, you would do well to consider some of the software listed here.

A handy set of tools if you are working with geoscience types on topic map applications as well.

Comments Off

CS545: Machine Learning (Fall 2011)

Filed under: Machine Learning,Python — Patrick Durusau @ 8:17 pm

CS545: Machine Learning (Fall 2011)

From the Overview page:

In this class you will learn about a variety of approaches to using a computer to discover patterns in data. The approaches include techniques from statistics, linear algebra, and artificial intelligence. Students will be required to solve written exercises, implement a number of machine learning algorithms and apply them to sets of data, and hand in written reports describing the results.

For implementations, we will be using Python. You may download and install Python on your computer, and work through the on-line tutorials to help prepare for this course. For the written reports, we will be using LaTeX, a document preparation system freely available on all platforms.

There has always been a lot of CS stuff online but the last couple of years it seems to have exploded. Python jockeys will like this one.

Comments Off

Mongoid_fulltext

Filed under: MongoDB,N-Grams — Patrick Durusau @ 8:16 pm

Mongoid_fulltext: full-text n-gram search for your MongoDB models by Daniel Doubrovkine.

From the post:

We’ve been using mongoid_search for sometime now for auto-complete. It’s a fine component that splits sentences and uses MongoDB to index them. Unfortunately it doesn’t rank them, so results come in order of appearance. In contrast, mongoid-fulltext uses n-gram matching (with n=3 right now), so we index all of the substrings of length 3 from text that we want to search on. If you search for “damian hurst”, mongoid_fulltext does lookups for “dam”, “ami”, “mia”, “ian”, “an “, “n h”, ” hu”, “hur”, “urs”, and “rst” and combines the results to get a most likely match. This also means users can make simple spelling mistakes and still find something relevant. In addition, you can index multiple collections in a single index, producing best matching results within several models. Finally, mongoid-fulltext leverages MongoDB native indexing and map-reduce.

And see: https://github.com/aaw/mongoid_fulltext.

Might want to think about this for your next text input by user application.

Comments Off

FACTA

Filed under: Associations,Bioinformatics,Biomedical,Concept Detection,Text Analytics — Patrick Durusau @ 8:16 pm

FACTA – Finding Associated Concepts with Text Analysis

From the Quick Start Guide:

FACTA is a simple text mining tool to help discover associations between biomedical concepts mentioned in MEDLINE articles. You can navigate these associations and their corresponding articles in a highly interactive manner. The system accepts an arbitrary query term and displays relevant concepts on the spot. A broad range of concepts are retrieved by the use of large-scale biomedical dictionaries containing the names of important concepts such as genes, proteins, diseases, and chemical compounds.

A very good example of an exploration tool that isn’t overly complex to use.

Comments Off

Digital Methods

Filed under: Graphs,Interface Research/Design,Web Applications,WWW — Patrick Durusau @ 8:16 pm

Digital Methods

From the website:

Welcome to the Digital Methods course, which is a focused section of the more expansive Digital Methods wiki. The Digital Methods course consists of seven units with digital research protocols, specially developed tools, tutorials as well as sample projects. In particular this course is dedicated to how else links, Websites, engines and other digital objects and spaces may be studied, if methods were to follow the medium, as opposed to importing standard methods from the social sciences more generally, including surveys, interviews and observation. Here digital methods are central. Short literature reviews are followed by distinctive digital methods approaches, step-by-step guides and exemplary projects.

Jack Park forwarded this link. A site that merits careful exploration. You will find things that you did not expect. Much like using the WWW. 😉

Curious what parts of it you find to be the most useful/interesting?

The section on digital tools is my current favorite. I suspect that may change as I continue to explore the site.

Enjoy!

Comments Off

Power Modeling And Querying with Neo4j

Filed under: Cypher,Neo4j — Patrick Durusau @ 8:15 pm

Power Modeling And Querying with Neo4j

From the description:

Neo4j 1.5 has just been released, with another batch of features and enhancements. In this talk, Alistair Jones will demonstrate two significant changes. First we’ll look at the new custom visualisation support, that turns Neo4j Server into a sophisticated analysis tool.Then we’ll turn to the Cypher query language, first introduced in Neo4j 1.4, and now beefed up with even more powerful graph-oriented features. Alistair will demonstrate how simple cypher queries can now find answers that would otherwise require a lot of code, or which would have been nearly impossible in a relational database.

Neo4j 1.5 highlights:

Kernel: better, smaller property storage
Web admin: custom visualization
Cypher: more powerful queries

Well, as one watcher of the video, I wasn’t lucky that the kernel details weren’t covered! 😉 Understand the reasoning and time constraints but hard core presentations are appreciated as well.

The style for displays looks quite interesting but overly complicated. Suggestion: Names should be the defaults for nodes and edges, not a user defined style.

If attendees are though to have trouble reading screens, consider the plight of people watching the podcast. Capture the screen separately.

On Cypher, “more logical to read” isn’t a good marketing point. Easier to follow SQL query format.

Expect someone to write a SQL-like DSL if Neo4j does not.

Variable Length Paths – Ability that passes SQL!

Jump to time mark: 52:53 (or a bit before, I had trouble with the back/forward control):

Any relationship
Directed
Typed
Limited length
Shortest path

The entire video is quite nice but watch this part if no other.

Applications will suggest themselves.

Comments Off

Unified Graph Views with Multigraph

Filed under: Graphs,Merging — Patrick Durusau @ 8:15 pm

Unified Graph Views with Multigraph by Joshua Shinavier.

From the post:

Next in line in this parade of new Blueprints utilities: a Graph implementation called MultiGraph which has just been pushed to Tinkubator:

https://github.com/tinkerpop/tinkubator

MultiGraph wraps multiple, lower-level Graph implementations and provides a combined graph view, unifying vertices and edges by id. So, for example, if you have a vertex with an id of “Arthur” in graph #1 and another vertex with an id of “Arthur” in graph #2, and you put those graphs into a MultiGraph, the unified vertex with the id “Arthur” will have all of the properties of either vertex, as well as all of the edges to or from either vertex. Any vertices and edges which exist in some graphs but not in others will also exist in the MultiGraph view.

Using ids to trigger merging and precedence to cope with conflicting values for edges. I don’t think “precedence” is going to be very robust in the long run but every project has to start somewhere. Preserving provenance after merging is likely to be a requirement in many applications.

This and similar discussions happen at the Gremlin-Users group.

Comments Off

December 3, 2011

A Path Algebra for Multi-Relational Graphs

Filed under: Graphs,Multi-Relational,Neo4j,Path Algebra — Patrick Durusau @ 8:23 pm

A Path Algebra for Multi-Relational Graphs by Marko A. Rodriguez, Peter Neubauer.

Abstract:

A multi-relational graph maintains two or more relations over a vertex set. This article defines an algebra for traversing such graphs that is based on an $n$-ary relational algebra, a concatenative single-relational path algebra, and a tensor-based multi-relational algebra. The presented algebra provides a monoid, automata, and formal language theoretic foundation for the construction of a multi-relational graph traversal engine.

Only four (4) pages but it is heavy sledding from the first paragraph to the last. 😉 Still, if you want a peek at what fine minds, Rodriguez and Neubauer, think about when they see Neo4j and its future, this will be worth the effort.

Comments Off

Evolutionary Subject Tagging in the Humanities…

Filed under: Classification,Digital Culture,Digital Library,Humanities,Tagging — Patrick Durusau @ 8:22 pm

Evolutionary Subject Tagging in the Humanities; Supporting Discovery and Examination in Digital Cultural Landscapes by JackAmmerman, Vika Zafrin, Dan Benedetti, Garth W. Green.

Abstract:

In this paper, the authors attempt to identify problematic issues for subject tagging in the humanities, particularly those associated with information objects in digital formats. In the third major section, the authors identify a number of assumptions that lie behind the current practice of subject classification that we think should be challenged. We move then to propose features of classification systems that could increase their effectiveness. These emerged as recurrent themes in many of the conversations with scholars, consultants, and colleagues. Finally, we suggest next steps that we believe will help scholars and librarians develop better subject classification systems to support research in the humanities.

Truly remarkable piece of work!

Just to entice you into reading the entire paper, the authors challenge the assumption that knowledge is analogue. Successfully in my view but I already held that position so I was an easy sell.

BTW, if you are in my topic maps class, this paper is required reading. Summarize what you think are the strong/weak points of the paper in 2 to 3 pages.

Comments Off

A quick study of Scholar-ly Citation

Filed under: HCIR,Interface Research/Design,Searching — Patrick Durusau @ 8:22 pm

A quick study of Scholar-ly Citation by Gene Golovchinsky.

From the post:

Google recently unveiled Citations, its extension to Google Scholar that helps people to organize the papers and patents they wrote and to keep track of citations to them. You can edit metadata that wasn’t parsed correctly, merge or split references, connect to co-authors’ citation pages, etc. Cool stuff. When it comes to using this tool for information seeking, however, we’re back to that ol’ Google command line. Sigh.

Gene covers use of the Citations interface in some detail and then offers suggestions and pointers to resources that could help Google create a better interface.

Can’t say whether Google will take Gene’s advice or not. If you are smart, when you are designing an interface for similar material, you will.

Or as Gene concludes:

In short, Google seems to have taken the lessons from general web search, and applied them to Google Scholar, with predictable results. Instead, they should look at Google Scholar as an opportunity to learn about HCIR, about exploratory search with long-running, evolving information needs, and to apply those lessons to the web search interface.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 6, 2011

December 5, 2011

December 4, 2011

December 3, 2011