Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 15, 2011

Skiena’s Algorithms Lectures

Filed under: Algorithms,CS Lectures — Patrick Durusau @ 7:31 pm

Skiena’s Algorithms Lectures by Steven Skiena, Stony Brook University.

From the website:

Below are audio, video and lecture sides for 1997 and 2007. Since the lectures are 10 years apart some of the topics covered by the course have changed. The 1997 lectures have a better quality video and audio than the 2007, although the 2007 covers the newer material and has better lecture notes.

If you found this useful also check out the video lectures of my Discrete Mathematics, Computational Biology, and Computational Finance courses.

I have the first edition of “The Algorithm Design Manual,” which is now out in a second edition. Guess it is time for an upgrade. šŸ˜‰

There are going to be startups that re-implement assumptions based on prior hardware limitations and those who have a competitive advantage. Which one do you want to be?

Saw this in a tweet from @CompSciFact.

Visitor Conversion with Bayesian Discriminant and Hadoop

Filed under: Bayesian Models,Hadoop,Marketing — Patrick Durusau @ 7:31 pm

Visitor Conversion with Bayesian Discriminant and Hadoop

From the post:

You have lots of visitors on your eCommerce web site and obviously you would like most of them to convert. By conversion, I mean buying your product or service. It could also mean the visitor taking an action, which potentially could financially benefit the business e.g., opening an account or signing up for email new letter. In this post, I will cover some predictive data mining techniques that may facilitate higher conversion rate.

Wouldnā€™t it be nice if for any ongoing session, you could predict the odds of the visitor converting during the session, based on the visitorā€™s behavior during the session.

Armed with such information, you could take different kinds of actions to enhance the chances of conversion. You could entice the visitor with a discount offer. Or you could engage the visitor in a live chat to answer any product related questions.

There are simple predictive analytic techniques to predict the probability of of a visitor converting. When the predicted probability crosses a predefined threshold, the visitor could be considered to have high potential of converting.

I would ask the question of “conversion” more broadly.

That is how can we dynamically change the model of subject identity in a topic map to match a user’s expectations? What user behavior and how would we track it to reach such an end?

Reasoning that users are more interested in and more likely to support topic maps that reinforce their world views. And selling someone topic map output that they find agreeable is easier than output they find disagreeable.

A Workflow for Digital Research Using Off-the-Shelf Tools

Filed under: Authoring Topic Maps,Digital Research,Research Methods — Patrick Durusau @ 7:30 pm

A Workflow for Digital Research Using Off-the-Shelf Tools by William J. Turkel.

An excellent overview of useful tools for digital research.

One or more of these will be useful in authoring your next topic map.

Quantitative structure-activity relationship (QSAR)

Filed under: Cheminformatics,QSAR — Patrick Durusau @ 7:30 pm

I ran across enough materials while researching AZOrange that I needed to make a separate post on QSAR:

The Cheminformatics and QSAR Society

An Introduction to QSAR Methodology by Allen B. Richon

Quantitative structure-activity relationship – Wikipedia

QSAR World

Of greatest interest for people involved in cheminformatics, toxicity, drug development, etc.

Subject identity cuts across every field.

Now that would be an ambitious and interesting book, “Subject Identity.” An edited volume with contributions from experts in a variety of fields.

August 14, 2011

KDnuggets

Filed under: Analytics,Conferences,Data Mining,Humor — Patrick Durusau @ 7:13 pm

KDnuggests

Good site to follow for data mining and analytics resources, ranges from conference announcements, data mining sites and forums, software, to crossword puzzles.

See: Analytics Crossword Puzzle 2.

I like that, it has a timer. One that starts automatically.

Maybe topic maps needs a cross-word puzzle or two. Pointers? Suggestions for content/clues?

Planet Cassandra

Filed under: Cassandra,NoSQL — Patrick Durusau @ 7:13 pm

Planet Cassandra

Aggregation of feeds on Cassandra. If you need to follow Cassandra closely, this would be among your first stops.

disclojure

Filed under: Clojure,Functional Programming — Patrick Durusau @ 7:12 pm

disclojure: public disclosure of all things clojure

The “Where To Get Started With Clojure” tab is particularly helpful.

Springy

Filed under: Algorithms,Graphs,Visualization — Patrick Durusau @ 7:12 pm

Springy

From the webpage:

Springy is a force directed graph layout algorithm.

So what does this ‘force directed’ stuff mean anyway? Excellent question!

It basically means that it uses some real world physics to try and figure out how to show a network graph in a nice way.

Try to imagine it as a bunch of springs connected to each other.

Written in JavaScript. No surprises but you may find it useful for webpages.

Structr

Filed under: Graphs,Neo4j,structr — Patrick Durusau @ 7:11 pm

Structr

Bills itself as:

More than a CMS: structr is a content application framework based on the graph database Neo4j.

Version 0.4.M01 is available at Github.

It occurs to me that a content creation interface could offer a query on properties feature enable node discovery. Or that could be automatic. Such that when I type “structr” that post is added as an occurrence to the topic for structr. End of post the interface offers me the opportunity to say what is correct/incorrect.

Graph Databases, NoSQL and Neo4j

Filed under: Graphs,Neo4j,NoSQL — Patrick Durusau @ 7:11 pm

Graph Databases, NoSQL and Neo4j by Peter Neubauer.

From the post:

Of the  many different datamodels, the relational model has been dominating since the 80s, with implementations like Oracle, MySQL and MSSQL – also known as Relational Database Management System (RDBMS). Lately, however, in an increasing number of cases the use of relational databases leads to problems both because of Deficits and problems in the modeling of data and constraints of horizontal scalability over several servers and big amounts of data. There are two trends that bringing these problems to the attention of the international software community:

  1. The exponential growth of the volume of data generated by users, systems and sensors, further accelerated by the concentration of large part of this volume on big distributed systems like Amazon, Google and other cloud services.
  2. The increasing interdependency and complexity of data, accelerated by the Internet,  Web2.0, social networks and open and standardized access to data sources from a large number of different systems.

The relational databases have increasing problems to cope with these trends. This has led to a number of different technologies targeting special aspects of these problems, which can be used together or alternatively to the existing RDBMS – also know as Polyglot Persistence. Alternative databases are nothing new, they have been around for a long time in the form of e.g.  Object Databases (OODBMS), Hierarchical  Databases (e.g. LDAP) and many more. But during the last few years a large number of new projects have been started which together are known under the name NOSQL-databases.

This article aims to give an overview of the position of Graph Databases in the NOSQL-movement. The second part is an introduction to Neo4j, a Java-based Graph Database.

Excellent article with heavy cross-linking to additional information.

Dorothy: Graphviz from the comfort of Clojure

Filed under: Clojure,Graphs,Graphviz — Patrick Durusau @ 7:10 pm

Dorothy: Graphviz from the comfort of Clojure

From the post:

I’ve used Graphviz quite a bit in the past. When I did, I was almost always generating dot files from code; I never wrote them by hand other than to experiment with various Graphviz features. Well, string-slinging is a pain. Generating one language from another is a pain. So, inspired by Hiccup, I threw together Dorothy over the last couple evenings. It’s a Clojure DSL for generating DOT graph representations as well as rendering them to images.

For a few hours work, the documentation is pretty thorough, so I’ll leave off with one simple example which is translated from the Entity Relationship example in the Graphviz gallery. Any feedback or criticism is welcome and encouraged.

Instructive on Clojure, useful for DOT graph representations. That’s a win-win situation!

Infogrid 2.9.5 Released

Filed under: Graphs,Infogrid — Patrick Durusau @ 7:10 pm

Infogrid 2.9.5 Released

If you haven’t looked at Infogrid recently, it’s time for a visit.

Infogrid consists of a number of sub-projects:

InfoGrid Graph Database Project

Develops the GraphDatabase at the heart of InfoGrid. Can be used as a standalone graph database or in addition to the other InfoGrid projects.

InfoGrid Graph Database (Grid) Project

Augments the GraphDatabase with a replication protocol, so that many distributed GraphDatabases can collaborate in managing very large graphs.

InfoGrid Stores Project

Provides an abstract common interface to storage technologies such as SQL databases and distributed NoSQL hashtables. This enables an InfoGrid GraphDatabase to persist its data using any of several different storage technologies but with the same API for application developers.

InfoGrid User Interface Project

REST-fully maps the content of a GraphDatabase to browser-accessible URLs. Viewlets allow developers to define how individual objects and sub-graphs are rendered. The project also implements a library of Viewlets, and the MeshWorld and NetMeshWorld example applications.

InfoGrid Light-Weight Identity Project

Implements user-centric identity technologies such as LID and OpenID.

InfoGrid Model Library Project

Defines a library of reusable object models that can be used as schemas for InfoGrid applications.

InfoGrid Probe Project

Implements the Probe Framework, which enables application developers to treat any data source on the internet as a graph of objects. This project also implements a library of Probes for common data formats.

InfoGrid Utilities Project

Collects common object frameworks and utility code used throughout InfoGrid.

Mining Data in Motion

Filed under: Data Mining,Marketing,Topic Maps — Patrick Durusau @ 7:09 pm

Mining Data in Motion by Chris Nott says: “…The scope for business innovation is considerable.

Or in context:

Mining data in motion. On the face of it, this seems to be a paradox: data in motion is transitory and so can’t be mined. However, this is one of the most powerful concepts for businesses to explore innovative opportunities if they can only release themselves from the constraints of today’s IT thinking.

Currently analytics are focused on data at rest. But exploiting information as it arrives into an organisation can open up new opportunities. This might include influencing customers as they interact based on analytics triggered by web log insight, social media analytics, a real-time view of business operations, or all three. The scope for business innovation is considerable.

The ability to mine this live information in real time is a new field of computer science. The objective is to process information as it arrives, using the knowledge of what has occurred in the past, but the challenge is in organising the data in a way that it is accessible to the analytics, processing a stream of data in motion.

Innovation in this context is going to require subject recognition, whether so phrased or not, and collation with other information, some of which may also be from live feeds.

Curious if standards for warranting the reliability of identifications or information in general are going to arise? Suspect there will be explicit liability limitations for information and the effort made to verify it. Free information will likely carry a disclaimer for any use for any purpose. Take your chances.

How reliable the information you are supplied depending upon the level of liability you have purchased.

I wonder how an “information warranty” economy would affect information suppliers who now disavow responsibility for their information content. Interesting because businesses would not hire lawyers or accountants who did not take some responsibility for their work. Perhaps there are more opportunities in data mining than just data stream mining.

Perhaps: Topic Maps – How Much Information Certainty Can You Afford?

Information could range from the fun house stuff you see on Fox to full traceability to sources that expands in real time. Depends on what you can afford.

Recommendation Engine Powered By Hadoop

Filed under: Hadoop,Recommendation — Patrick Durusau @ 7:09 pm

Recommendation Engine Powered By Hadoop by Pranab Ghosh.

Nice set of slides on use of Hadoop to power a recommendation engine. (Implicit subject recognition and therefore difficult to fashion explicit merger from different sources.)

At least on Slideshare the additional resource links aren’t working on the slides. So, for your reading pleasure:

Pranab’s blog: Mawazo A number of interesting posts on NoSQL and related technologies.

Including Pranab’s two-part blog post on Hadoop and recommendation engines:

Recommendation Engine Powered by Hadoop (Part 1)

Recommendation Engine Powered by Hadoop (Part 2)

and, Mining of Massive Datasets by Anand Rajaraman and Jeff Ullman.

August 13, 2011

Connecting the dots in big data (Tuesday 16 August 2011)

Filed under: Graphs,InfiniteGraph — Patrick Durusau @ 3:48 pm

Connecting the dots in big data

From the post:

Join us Tuesday August 16, 2011 (11:00am PT / 2:00pm ET), for a webinar with InfiniteGraph and DBTA (Database Trends and Applications), where weā€™ll be giving an introduction to InfiniteGraph, and speaking about connecting the dots to find meaning in big data.

Big Data problems are quickly presenting themselves in almost every area of computing from Social Network Analysis to File Processing. Many technologies, such as those in the NoSQL space were developed in response to the limitations of current storage systems as an effective mechanism to deal with these mountains of data. And much of that data is interconnected in ways that, when organized properly, gives interesting and often valuable information.

InfiniteGraph, the distributed and scalable graph database, was designed specifically to traverse connections and provide the framework for a new set of products built to provide real-time business decision support and relationship analytics. This presentation examines the technology behind InfiniteGraph and explores a couple of common use cases involving very large scale graph processing.

Introducing Gephi 0.7

Filed under: Gephi,Graphs,Visualization — Patrick Durusau @ 3:47 pm

Introducing Gephi 0.7

Yes, I know that Gephi 0.8 is out in alpha release but this video is worth viewing, even though it is about the “old” version.

From the description:

The video highlights the following features:

  • grouping: Group nodes into clusters and navigate in multi-level graphs.
  • multi-level layout: Very fast layout algorithm that coersen the graph to reduce computation.
  • interaction: Highlight neighbors and interact directly with the visualization when using tools.
  • partitionning: Use data attributes to colorize partitions and communities.
  • ranking: Use degree, metrics or data attributes to set nodes/edgesā€™ color and size.
  • metrics: Run various algorithm in one click and get HTML report page.
  • data laboratory: Data table view with search feature.
  • dynamics: Use Timeline to explore dynamic graphs.
  • filtering: Dynamic queries, create and combine a large set of filters.
  • auto update: The application is updating itself itā€™s core and plugins.
  • vectorial preview: Switch to the preview tab to put the final touch before explorting in SVG or PDF.

CAS Registry Number & The Semantic Web

Filed under: Cheminformatics,Identifiers,Indexing — Patrick Durusau @ 3:47 pm

CAS Registry Number

Another approach to the problem of identification, assign an arbitrary identifier for which you hold the key.

If you start early enough in a particular era, you can gain enough of an advantage to deter most competitors. Particularly if you curate the professional literature so that you can provide effective searching based on your (and other) identifiers.

The similarity to the Semantic Web’s assignment of a URL to every subject is not accidental.

The main differences with the Semantic Web:

  1. Economically important activity was focus of the project.
  2. Professional literature base with obvious value-add potential for research and production.
  3. Single source curators of the identifiers (did not whine at others to create them).
  4. Identification where there was user demand to support the effort.

The Wiki page reports (in part):

CAS Registry Numbers are unique numerical identifiers assigned by the “Chemical Abstracts Service” to every chemical described in the open scientific literature (currently including those described from at least 1957 through the present) and including elements, isotopes, organic and inorganic compounds, organometallics, metals, alloys, coordination compounds, minerals, and salts; as well as standard mixtures, compounds, polymers; biological sequences including proteins & nucleic acids; nuclear particles, and nonstructurable materials (aka ‘UVCB’s- i.e., materials of Unknown, Variable Composition, or Biological origin). They are also referred to as CAS RNs, CAS Numbers, etc.

The Registry maintained by CAS is an authoritative collection of disclosed chemical substance information. Currently the CAS Registry identifies more than 56 million organic and inorganic substances and 62 million sequences, plus additional information about each substance; and the Registry is updated with an approximate 12,000 additional new substances daily.

Historically, chemicals have been identified by a wide variety of synonyms. Frequently these are arcane and constructed according to regional naming conventions relating to chemical formulae, structures or origins. Well-known chemicals may additionally be known via multiple generic, historical, commercial, and/or black-market names.

PS: The index is now at 61+ million substances.

InChl – IUPAC International Chemical Identifier

Filed under: Cheminformatics,Identifiers — Patrick Durusau @ 3:47 pm

The Semantic Chemical Entity Specification was useful in pointing me towards InChl – IUPAC International Chemical Identifiers (Wiki page).

From the Wiki page:

The identifiers describe chemical substances in terms of layers of information ā€” the atoms and their bond connectivity, tautomeric information, isotope information, stereochemistry, and electronic charge information. Not all layers have to be provided; for instance, the tautomer layer can be omitted if that type of information is not relevant to the particular application.

InChIs differ from the widely used CAS registry numbers in three respects:

  • they are freely usable and non-proprietary;
  • they can be computed from structural information and do not have to be assigned by some organization;
  • most of the information in an InChI is human readable (with practice).

I like the compute from structural information aspect. Reminds me of Eric Freese and his topic map example that calculated extended family relationships based on parent/child, sibling relationships.

What other areas would benefit from computable identifications and how would you go about constructing them? Such that the same set of inputs results in the same identifier?

The Wiki page cites a number of other resources on chemical identification that will be useful if you are straying into work with chemical databases.

Chemical Entity Semantic Specification

Filed under: Cheminformatics,RDF,Semantic Web — Patrick Durusau @ 3:46 pm

Chemical Entity Semantic Specification

From the website:

Chemical Entity Semantic Specification (CHESS) framework strives to provide a means of representing chemical data with the goal of facile chemical information federation and addressing increasingly rich and complex queries for biological, pharmaceutical, and synthetic chemistry applications. The principal emphasis of CHESS is data representation to assist in metabolic fate determination, synthetic pathway construction, and automatic chemical entity classification. With explicit semantic specification of reactions for example, CHESS allows the tracing of the mechanisms of chemical transformations on the level of individual atoms, bonds, functional groups, or molecules, as well as the individual “histories” of elements of chemical entities in a pathway. Further, the CHESS framework draws on CHEMINF and SIO ontologies to provide methods for specifying uncertainty, conformer-specific information, units, and circumstances for physical measurements at variable levels of granularity, permitting rich, cross-domain queries over this data. In addition to this, CHESS provides a set of specifications to address data federation through the adoption of unique, canonical identifiers for many classes of chemical entities.

Interesting project but appears to lack uptake.

As of 13 August 2011, I get nine (9) “hits” from a popular search engine on the name as a string.

Useful as a resource for existing ontologies and identification schemes.

August 12, 2011

Apache CouchDB & Elasticsearch

Filed under: CouchDB,ElasticSearch — Patrick Durusau @ 7:22 pm

Apache CouchDB & Elasticsearch by BenoƮt Chesneau.

With one hundred and twenty-five (125) slides you can get off into the weeds and talk about the details. Very much worth your time to take a look.

Gephi News: new Visualization API

Filed under: Gephi,Visualization — Patrick Durusau @ 7:22 pm

Gephi News: new Visualization API

Work is underway on a new visualization API for Gephi. If you are interested in writing visualization of graph software, here’s your opportunity to make a difference.

5 real-world uses of big data (?)

Filed under: BigData — Patrick Durusau @ 7:21 pm

5 real-world uses of big data (?)

I ran across this post by David Smith that starts off well enough:

In the past year, big data has emerged as one of the most closely watched trends in IT. Organizations today are generating more data in a single day than that the entire Internet was generated as recently as 2000. The explosion of ā€œbig dataā€ā€“much of it in complex and unstructured formatsā€“has presented companies with a tremendous opportunity to leverage their data for better business insights through analytics.

Wal-Mart was one of the early pioneers in this field, using predictive analytics to better identify customer preferences on a regional basis and stock their branch locations accordingly. It was an incredibly effective tactic that yielded strong ROI and allowed them to separate themselves from the retail pack. Other industries took notice of Wal-Martā€™s tactics ā€” and the success they gleaned from processing and analyzing their data ā€” and began to employ the same tactics.

But then none of the examples were Big Data:

  • Afghanistan War Diaries (I don’t remember there being “terabytes” of data. Gigabytes, yes, but not terabytes.)
  • Guatemalaā€™s National Police records, some 80 million, to document Mayan descent genocide (another smallish data set)
  • Bill James (he of Moneyball fame) is a well-known figure in the world of both baseball and statistics at this point, but that has not always been the case. (But it was in 2003 when he went to work for the Boston Red Sox, a bit out of range for a Big Data story.)
  • BP Oil Spill – “NIST used the open source R language to run an uncertainty analysis that harmonized the estimates from various sources to come up with actionable intelligence around which disaster response efforts could be coordinated.” (Big Disaster, yes, but not Big Data.)
  • CardioDX – “researchers at the company analyzed over 100 million gene samples to ultimately identify the 23 primary predictive genes for coronary artery disease.” (True, but not all at one time. Over a period of weeks if not months, running R routines to analyze data. Time consuming but not a Big Data story.)

All are great examples of data analysis and should be celebrated as such. But let’s reserve Big Data for data sets that pose storage or processing challenges that are not routinely met by the average desktop machine.

A day’s output from the Large Hadron Collider or one of the all-sky survey telescopes, something that undoubtedly is Big Data.

Whether “Big” or “small,” or “in-between” data, the real key is useful analysis.

NoSQL standouts: New databases for new applications

Filed under: Cassandra,CouchDB,FlockDB,Neo4j — Patrick Durusau @ 7:21 pm

NoSQL standouts: New databases for new applications: Cassandra, CouchDB, MongoDB, Redis, Riak, Neo4J, and FlockDB reinvent the data store.

From the post:

Was it just two or three years ago when choosing a database was easy? Those with a Cadillac budget bought Oracle, those in a Microsoft shop installed SQL Server, those with no budget chose MySQL. Everyone in between tried to figure out where they belonged.

Those days are gone forever. Everyone and his brother are coming out with their own open source project for storing information. In most cases, these projects are tossing aside many of the belts-and-suspenders protections that people expect from the classic databases. There are enough of them now that some joker started calling them NoSQL and claiming, perhaps tongue-in-cheek, that the acronym stood for Not Only SQL.

I remember reading somewhere that the #1 reason for firing sysadmins was failure to maintain proper backups. A RDBMS system isn’t a magic answer to data security and anyone who thinks so, is probably a former sysadmin at one or more locations. šŸ˜‰

You need to read Jim Grey’s Transaction Processing: Concepts and Techniques if you want to design reliable systems. Or that is at least one of the works you need to read.

Do use the “print” option so you can read the article while avoiding most of the annoying distractions typical for this type of site.

Not detailed enough to be particularly useful. Actually I haven’t seen a comparison yet that was detailed enough to be really useful. I suppose in part because the approaches are different, would be hard compare apples with apples.

What might be useful would be to compare the use cases where each system claims to excel. Now that might be a continuum of interest to readers.

What do you think?

August 11, 2011

Do No Evil (but patent the obvious)

Filed under: Search Algorithms,Searching — Patrick Durusau @ 6:38 pm

On August 2, 2011, the US Patent office issued 7,991,780, Performing multiple related searches:

The Abstract reads:

A first search is performed in response to a received search query. The first search is based at least in part on a first portion of the search query. In the first search, a first set of content items are searched over to identify a first set of search results. Each result in the first set of search results identifies at least one content item of the first set of content items. A second set of content items for performing a second search is determined based at least in part on one or more of the results in the first set of search results. The second set of content items includes content items not included in the first set of search results. A second search is performed, searching over the second set of content items to identify a second set of search results. The second search is based at least in part on a second portion of the search query. Each result in the second set of search results identifies at least one content item of the second set of content items.

I have known about this for several days but had to give myself a “time out” so that my post would not be laced with obscenities and speculations on parentage and personal habits of which I have no personal knowledge.

I must also say that nothing in this post should be construed or taken as legal advice or opinion on the form, substance or history of this patent or any of the patents mentioned in this patent.

What I would like to say is that I remember subset searching in the days of Lexis/Nexis hard wired terminals in law offices (I had one) and that personal experience was twenty-five (25) years ago. The patent uses the magic words web pages and web addresses but I recognize the searches I did so long ago.

Perhaps I can describe integer arithmetic in a patent application using web page(s) and web address(es) terminology. If I can, the first person to front the fees and expenses, I will assign a one-half interest in the patent. Contact me privately and we will arrive at a figure for the application. Just don’t tell Intel. šŸ˜‰

BTW, topic maps, with merging of information about a subject avoids the need for a secondary search on a sub-set of material in some cases.

Something that is long over-due and uniquely suited for topic maps would be a survey of the IP in the area of search and retrieval. Not in the sense of legal advice but simply putting together all the relevant patents. With a mapping of the various claims.

The joy of algorithms and NoSQL: a MongoDB example (part 2)

Filed under: Algorithms,Cheminformatics,MapReduce,MongoDB — Patrick Durusau @ 6:35 pm

The joy of algorithms and NoSQL: a MongoDB example (part 2)

From the post:

In part 1 of this article, I described the use of MongoDB to solve a specific Chemoinformatics problem, namely the computation of molecular similarities. Depending on the target Tanimoto coefficient, the MongoDB solution is able to screen a database of a million compounds in subsecond time. To make this possible, queries only return chemical compounds which, in theory, are able to satisfy the particular target Tanimoto. Even though this optimization is in place, the number of compounds returned by this query increases significantly when the target Tanimoto is lowered. The example code on the GitHub repository for instance, imports and indexes ~25000 chemical compounds. When a target Tanimoto of 0.8 is employed, the query returns ~700 compounds. When the target Tanimoto is lowered to 0.6, the number of returned compounds increases to ~7000. Using the MongoDB explain functionality, one is able to observe that the internal MongoDB query execution time increases slightly, compared to the execution overhead to transfer the full list of 7000 compounds to the remote Java application. Hence, it would make more sense to perform the calculations local to where the data is stored. Welcome to MongoDBā€™s build-in map-reduce functionality!

Screening “…millions of compounds in subsecond time” sounds useful in a topic map context.

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

Filed under: Concept Detection,Synonymy,UIMA — Patrick Durusau @ 6:34 pm

UIMA Concept Mapping Interface to Lucene/Neo4j Datastore

From the post:

Over the past few weeks (months?) I have been trying to build a system for concept mapping text against a structured vocabulary of concepts stored in a RDBMS. The concepts (and their associated synonyms) are passed through a custom Lucene analyzer chain and stored in a Neo4j database with a Lucene Index retrieval backend. The concept mapping interface is an UIMA aggregate Analysis Engine (AE) that uses this Neo4j/Lucene combo to annotate HTML, plain text and query strings with concept annotations.

Sounds interesting.

In particular:

…concepts (and their associated synonyms) are passed through….

sounds like topic map talk to me partner!

Depends on where you put your emphasis.

PouchDB ( Portable CouchDB JavaScript implementation )

Filed under: CouchDB,PouchDB — Patrick Durusau @ 6:34 pm

PouchDB ( Portable CouchDB JavaScript implementation )

If you are looking for portable storage for your topic map application PouchDB may be of interest:

PouchDB is a complete implementation of the CouchDB storage and views API that supports peer-to-peer replication with other CouchDB instances. The browser version is written for IndexedDatabase (part of HTML5). An additional implementation is in progress for leveldb.

Storage and Consistency

Unlike the other current couch-like browser APIs built on WebStorage (http://dev.w3.org/html5/webstorage/) PouchDB’s goal is to maintain the same kinds of consistency guarantees Apache CouchDB provides across concurrent connections across the multiple-tabs a user might be using to concurrently access an PouchDB database. This is something that just isn’t possible with the BrowserStorage API previous libraries like BrowserCouch and lawnchair use.

PouchDB also keeps a by-sequence index across the entire database. This means that, just like Apache CouchDB, PouchDB can replicate with other CouchDB instances and provide the same conflict resolution system for eventual consistency across nodes.
BrowserCouch

At this time PouchDB is completely independent from BrowserCouch. The main reason is just to keep the code base concise and focused as the IndexedDatabase specification is being flushed out.

After IndexedDatabase is more solidified it’s possible that BrowserCouch and PouchDB might merge to provide a simple fallback option for browsers the do not yet support IndexedDatabase.

Building The Ocean With Big Data

Filed under: Analytics,BigData,Data Analysis — Patrick Durusau @ 6:33 pm

Building The Ocean With Big Data

From the post:

While working at an agency with a robust analytics group is exciting, it can also be frustrating at times. Clients challenge us with questions that are often difficult to answer with a simple data pull/request. For example, an auto client may ask how digital media is driving auto sales for a specific model in a specific location. Another client may like to better understand how much they need to spend on digital media, and to that end, which media sequencing is most effective (e.g. search -> display -> search -> social, etc.). Questions like these require multiple large sets of data, often in varying formats and time ranges. So the question becomes, with data collection and aggregation more important than ever, what steps can we take to make sure we analyze Big Data in a meaningful way?

Topic maps face the same issues as analysis of Big Data, where do you start?

If you start with no plan or a poorly planned one, you can work very hard for little or no gain. This article, while framed for analysis, has good principals for organizing analysis or mapping of Big Data.

A Periodic Table of Visualizations

Filed under: Graphics,Visualization — Patrick Durusau @ 6:32 pm

A Periodic Table of Visualizations

When you mouse over entries, a small example of the visualization pops up.

Its not Jacques Bertin’s Semiology of Graphics: Diagrams, Networks, Maps (reprinted last year by the way), but it is still a useful exercise.

Cassandra: Introduction for System Administrators

Filed under: Cassandra,NoSQL — Patrick Durusau @ 6:32 pm

Cassandra: Introduction for System Administrators by Nathan Milford.

Introductory slide deck for administrators interested in Cassandra (or being asked to participate in its use).

« Newer PostsOlder Posts »

Powered by WordPress