Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 16, 2011

Hadoop User Group UK: Data Integration

Filed under: Data Integration,Flume,Hadoop,MapReduce,Pig,Talend — Patrick Durusau @ 4:12 pm

Hadoop User Group UK: Data Integration

Three presentations captured as podcasts from the Hadoop User Group UK:

LEVERAGING UNSTRUCTURED DATA STORED IN HADOOP

FLUME FOR DATA LOADING INTO HDFS / HIVE (SONGKICK)

LEVERAGING MAPREDUCE WITH TALEND: HADOOP, HIVE, PIG, AND TALEND FILESCALE

Fresh as of 13 October 2011.

Thanks to Skills Matter for making the podcasts available!

Partially Observable Markov Decision Processes

Filed under: Markov Decision Processes,Partially Observable,POMDPs — Patrick Durusau @ 4:12 pm

Partially Observable Markov Decision Processes

From the webpage:

This web site is devoted to information on partially observable Markov decision processes.

Choose a sub-topic below::

  • POMDP FAQ
  • POMDP Tutorial – I made a simplified POMDP tutorial a while back. It is still in a somewhat crude form, but people tell me it has served a useful purpose.
  • POMDP Papers – For research papers on POMDPs, see this page.
  • POMDP Code – In addition to the format and examples, I have C-code for solving POMDPs that is available.
  • POMDP Examples – From other literature sources and our own work, we have accumulated a bunch of POMDP examples.
  • POMDP Talks – Miscellaneous material for POMDP talks.

Problems?

Well, the site has not been undated since 2009.

But, given the timeless nature of the WWW, it shows up just after the Wikipedia page entry on “Partially Observable Markov Decision Processes.” That is to say it was #2 on the list of relevant resources.

Could be that no one has been talking about POMDPs for the last two years. Except that a quick search at Citeseer shows 18 papers there with POMDP in the text.

I understand interests changing, etc. but we need to develop ways to evaluate resources for the timely nature of their data and perhaps just as importantly, to be able to keep such resources updated.

Both of those are very open issues and I am interested in any suggestions for how to approach them.

POMDPs for Dummies

Filed under: Markov Decision Processes,Partially Observable,POMDPs — Patrick Durusau @ 4:11 pm

POMDPs for Dummies: partially observable Markov decision processes (POMDPs)

From the webpage:

This is a tutorial aimed at trying to build up the intuition behind solution procedures for partially observable Markov decision processes (POMDPs). It sacrifices completeness for clarity. It tries to present the main problems geometrically, rather than with a series of formulas. In fact, we avoid the actual formulas altogether, try to keep notation to a minimum and rely on pictures to build up the intuition.

I just found this today and even with pictures it is slow going. But, I thought you might appreciate something “different” for the week. Something to read, think about, then reread.

If you are taking the Stanford AI course you may remember the mentioning of “partially observable” in week 1. There was a promise of further treatment later in the course.

Google Prediction API graduates from labs

Filed under: Prediction,Predictive Model Markup Language (PMML) — Patrick Durusau @ 4:10 pm

Google Prediction API graduates from labs, adds new features by Zachary Goldberg, Product Manager.

From the post:

Since the general availability launch of the Prediction API this year at Google I/O, we have been working hard to give every developer access to machine learning in the cloud to build smarter apps. We’ve also been working on adding new features, accuracy improvements, and feedback capability to the API.

Today we take another step by announcing Prediction v1.4. With the launch of this version, Prediction is graduating from Google Code Labs, reflecting Google’s commitment to the API’s development and stability. Version 1.4 also includes two new features:

  • Data Anomaly Analysis
    • One of the hardest parts of building an accurate predictive model is gathering and curating a high quality data set. With Prediction v1.4, we are providing a feature to help you identify problems with your data that we notice during the training process. This feedback makes it easier to build accurate predictive models with proper data.
  • PMML Import
    • PMML has become the de facto industry standard for transmitting predictive models and model data between systems. As of v1.4, the Google Prediction API can programmatically accept your PMML for data transformations and preprocessing.
    • The PMML spec is vast and covers many, many features. You can find more details about the specific features that the Google Prediction API supports here.

(I added a paragraph break in the first text block for readability. It should be re-written but I am quoting.)

Suggest you take a close look at the features of PMML that Google does not support. Quite an impressive array of non-support.

A Visual Taxonomy Of Every Chocolate Candy, Ever

Filed under: Humor,Taxonomy,Visualization — Patrick Durusau @ 4:09 pm

A Visual Taxonomy Of Every Chocolate Candy, Ever

Just a reminder that information science need not be bland or even tasteless.

Not to mention this is a very clever use of visual layout to convey fairly complex information.

I got here via Chocolate as a Teaching Tool, at TaxoDiary. How could a title like that fail to catch your attention? 😉

The same source has The Very, Very, Many Varieties of Beer. It only lists 300 but I thought Lar Marius might find it useful in plotting a course to taste every brew on the planet. Would make a nice prize for the 2012 Balisage Conference.

Project ISO 25964-1 Thesauri and interoperability with other vocabularies

Filed under: Standards,Thesaurus,Vocabularies — Patrick Durusau @ 4:09 pm

Project ISO 25964-1 Thesauri and interoperability with other vocabularies

From the webpage:

This is an international standard development project of ISO Technical Committee 46 (Information and documentation) Subcommittee 9 (Identification and description). The assigned Working Group (known as ISO TC46/SC9/WG8) is revising, merging, and extending two existing international standards: ISO 2788 and ISO 5964. The end product is a new standard—ISO 25964, Information and documentation – Thesauri and interoperability with other vocabularies—supporting the development and application of thesauri in today’s expanding context of networking opportunities. It is being published in two parts, as follows:

ISO 25964, Thesauri and interoperability with other vocabularies

  • Part 1: Thesauri for information retrieval
  • Part 2: Interoperability with other vocabularies

Part 1 was published in August, 2011 and Part 2 is due to appear by the end of 2011.

Unless you have $332 (US) burning a hole in your pocket, you probably want to visit: Format for Exchange of Thesaurus Data Conforming to ISO 25964-1, which has the XML schema plus documentation, etc., await for your use.

I am very interested in how they handled interoperability in part 2.

CENDI Agency Indexing System Descriptions: A Baseline Report

Filed under: Government Data,Indexing,Thesaurus — Patrick Durusau @ 4:08 pm

CENDI Agency Indexing System Descriptions: A Baseline Report (1998)

In some ways a bit dated but also a snap-shot in time of the indexing practices of the:

  • National Technical Information Service (NTIS),
  • Department of Energy, Office of Scientific and Technical Information (DOE OSTI),
  • US Geological Survey/Biological Resources Division (USGS/BRD),
  • National Aeronautics and Space Administration, STI Program (NASA),
  • National Library of Medicine/National Institutes of Health (NLM),
  • National Air Intelligence Center (NAIC),
  • Defense Technical Information Center (DTIC).

The summary reads:

Software/technology identification for automatic support to indexing. As the resources for providing human indexing become more precious, agencies are looking for technology support. DTIC, NASA, and NAIC already have systems in place to supply candidate terms. New systems are under development and are being tested at NAIC and NLM. The aim of these systems is to decrease the burden of work borne by indexers.

Training and personnel issues related to combining cataloging and indexing functions. DTIC and NASA have combined the indexing and cataloging functions. This reduces the paper handling and the number of “stations” in the workflow. The need for a separate cataloging function decreases with the advent of EDMS systems and the scanning of documents with some automatic generation of cataloging information based on this scanning. However, the merger of these two diverse functions has been a challenge, particularly given the difference in skill level of the incumbents.

Thesaurus maintenance software. Thesaurus management software is key to the successful development and maintenance of controlled vocabularies. NASA has rewritten its system internally for a client/server environment. DTIC has replaced its systems with a commercial-off-the-shelf product. NTIS and USGS/BRD are interested in obtaining software that would support development of more structured vocabularies.

Linked or multi-domain thesauri. Both NTIS and USGS/BRD are interested in this approach. NTIS has been using separate thesauri for the main topics of the document. USGS/BRD is developing a controlled vocabulary to support metadata creation and searching but does not want to develop a vocabulary from scratch. In both cases, there is concern about the resources for development and maintenance of an agency-specific thesaurus. Being able to link to multiple thesauri that are maintained by their individual “owners” would reduce the investment and development time.

Full-text search engines and human indexing requirements. It is clear that the explosion of information on the web (both relevant web sites and web-published documents) cannot be indexed in the old way. There are not enough resources; yet, the chaos of the web bets for more subject organization. The view of current full-text search engines is that the users often miss relevant documents and retrieve a lot of “noise”. The future of web searching is unclear and demands or requirements that it might place on indexing is unknown.

Quality Control in a production environment. As resources decrease and timeliness becomes more important, there are fewer resources available for quality control of the records. The aim is to build the quality in at the beginning, when the documents are being indexed, rather than add review cycles. However, it is difficult to maintain quality in this environment.

Training time. The agencies face indexer turnover and the need to produce at ever-increasing rates. Training time has been shortened over the years. There is a need to determine how to make shorter training periods more effective.

Indexing systems designed for new environments, especially distributed indexing. An alternative to centralized indexers is a more distributed environment that can take advantage of cottage labor and contract employees. However, this puts increasing demands on the indexing system. It must be remotely accessible, yet secure. It must provide equivalent levels of validation and up-front quality control.

Major project: Update this report, focusing on the issues listed in the summary.

Large Data Sets and Ever Changing Terminology

Filed under: BigData,Indexing — Patrick Durusau @ 4:08 pm

Large Data Sets and Ever Changing Terminology from TaxoDiary.

From the post:

Indexing enables accurate, consistent retrieval to the full depth and breadth of the collection. This does not mean that the statistics-based systems the government loves so much will go away, but they are learning to embrace the addition of taxonomy terms as indexing.

To answer your question, relevant metadata, tagging, normalization of entity references and similar indexing functions just make it easier to allow a person to locate what’s needed. Easy to say and very hard to do.

Search is like having to stand in a long line waiting to order a cold drink on a hot day. So there will always be dissatisfaction because “search” stands between you and what you want. You want the drink but hate the line. That said, I think the reason controlled indexing (taxonomy or thesaurus) is so popular compared to the free ranging keywords is that they have control. They make moving through the line efficient. You know how long the wait is and what terms you need to achieve the result.

I like the “…cold drink on a hot day” comparison. Goes on to point out the problems created by “ever changing terminology.” Which isn’t something that is going to stop happening. People have been inventing new terminologies probably as long as we have been able to communicate. However poorly we do that.

The post does advocate use of MAI (machine assisted indexing) from the author’s company but the advantages of indexing ring true whatever you use to achieve that result.

I do think the author should get kudo’s for pointing out that indexing is a hard problem. No magic cures, no syntax to save everyone, and no static solutions. As domains change, so do indexes. It is just as simple as that.

Going Head to Head with Google (and winning)

Filed under: Search Engines,Searching — Patrick Durusau @ 4:07 pm

ETDEWEB versus the World-Wide-Web: A Specific Database/Web Comparison

I really need to do contract work writing paper titles. 😉

Cutting to the chase:

For the 15 topics in this study, ETDEWEB was shown to bring the user unique results not shown by Google or Google Scholar 86.7% of the time.

Caveat, these were topics where ETDEWEB is strong and did not include soft/hard porn, political blogs and similar material.

Abstract:

A study was performed comparing user search results from the specialized scientific database on energy related information, ETDEWEB, with search results from the internet search engines Google and Google Scholar. The primary objective of the study was to determine if ETDEWEB (the Energy Technology Data Exchange – World Energy Base) continues to bring the user search results that are not being found by Google and Google Scholar. As a multilateral information exchange initiative, ETDE’s member countries and partners contribute cost- and task-sharing resources to build the largest database of energy related information in the world. As of early 2010, the ETDEWEB database has 4.3 million citations to world-wide energy literature. One of ETDEWEB’s strengths is its focused scientific content and direct access to full text for its grey literature (over 300,000 documents in PDF available for viewing from the ETDE site and over a million additional links to where the documents can be found at research organizations and major publishers globally). Google and Google Scholar are well-known for the wide breadth of the information they search, with Google bringing in news, factual and opinion-related information, and Google Scholar also emphasizing scientific content across many disciplines. The analysis compared the results of 15 energy-related queries performed on all three systems using identical words/phrases. A variety of subjects was chosen, although the topics were mostly in renewable energy areas due to broad international interest. Over 40,000 search result records from the three sources were evaluated. The study concluded that ETDEWEB is a significant resource to energy experts for discovering relevant energy information. For the 15 topics in this study, ETDEWEB was shown to bring the user unique results not shown by Google or Google Scholar 86.7% of the time. Much was learned from the study beyond just metric comparisons. Observations about the strengths of each system and factors impacting the search results are also shared along with background information and summary tables of the results. If a user knows a very specific title of a document, all three systems are helpful in finding the user a source for the document. But if the user is looking to discover relevant documents on a specific topic, each of the three systems will bring back a considerable volume of data, but quite different in focus. Google is certainly a highly-used and valuable tool to find significant ‘non-specialist’ information, and Google Scholar does help the user focus on scientific disciplines. But if a user’s interest is scientific and energy-specific, ETDEWEB continues to hold a strong position in the energy research, technology and development (RTD) information field and adds considerable value in knowledge discovery.

October 15, 2011

Making Sense of Unstructured Data in Medicine Using Ontologies – October 19th

Filed under: Bioinformatics,Biomedical,Ontology — Patrick Durusau @ 4:30 pm

From the email announcement:

The next NCBO Webinar will be presented by Dr. Nigam Shah from Stanford University on “Making Sense of Unstructured Data in Medicine Using Ontologies” at 10:00am PT, Wednesday, October 19. Below is information on how to join the online meeting via WebEx and accompanying teleconference. For the full schedule of the NCBO Webinar presentations see: http://www.bioontology.org/webinar-series.

ABSTRACT:

Changes in biomedical science, public policy, information technology, and electronic heath record (EHR) adoption have converged recently to enable a transformation in the delivery, efficiency, and effectiveness of health care. While analyzing structured electronic records have proven useful in many different contexts, the true richness and complexity of health records—roughly 80 percent—lies within the clinical notes, which are free-text reports written by doctors and nurses in their daily practice. We have developed a scalable annotation and analysis workflow that uses public biomedical ontologies and is based on the term recognition tools developed by the National Center for Biomedical Ontology (NCBO). This talk will discuss the applications of this workflow to 9.5 million clinical documents—from the electronic health records of approximately one million adult patients from the STRIDE Clinical Data Warehouse—to identify statistically significant patterns of drug use and to conduct drug safety surveillance. For the patterns of drug use, we validate the usage patterns learned from the data against FDA-approved indications as well as external sources of known off-label use such as Medi-Span. For drug safety surveillance, we show that drug–disease co-occurrences and the temporal ordering of drugs and disease mentions in clinical notes can be examined for statistical enrichment and used to detect potential adverse events.

WEBEX DETAILS:
——————————————————-
To join the online meeting (Now from mobile devices!)
——————————————————-
1. Go to https://stanford.webex.com/stanford/j.php?ED=108527772&UID=0&PW=NZDdmNWNjOGMw&RT=MiM0
2. If requested, enter your name and email address.
3. If a password is required, enter the meeting password: ncbo
4. Click “Join”.

——————————————————-
To join the audio conference only
——————————————————-
To receive a call back, provide your phone number when you join the meeting, or call the number below and enter the access code.
Call-in toll number (US/Canada): 1-650-429-3300
Global call-in numbers: https://stanford.webex.com/stanford/globalcallin.php?serviceType=MC&ED=108527772&tollFree=0

Access code:929 613 752

SolrMarc 2.3.1 – Critical Bug Fix

Filed under: Solr,SolrMarc — Patrick Durusau @ 4:29 pm

SolrMarc 2.3.1 – Critical Bug Fix

Robert Haschart writes:

The recently released SolrMarc 2.3 has a serious problem where commits to a local solr index sets the expungeDeletes flag which causes a segment merge which can be nearly as expensive as a index optimize. Furthermore changes in the defaults for certain configuration properties cause the above behavior to be chosen by default. At UVA the processing time for our nightly updates jumped from about 30 minutes (of which about 20 minutes is the index optimize) to about 2hr 30 minutes.

So the error has been fixed and an updated version has been released. If you have recently downloaded a copy SolrMarc version 2.3, discard it, and download a copy of the updated release SolrMarc version 2.3.1

Federal Election Commission Campaign Data Analysis

Filed under: Data Source,FEC,Graphs,Neo4j — Patrick Durusau @ 4:29 pm

Federal Election Commission Campaign Data Analysis by Dave Fauth.

From the post:

This post is inspired by Marko Rodriguez’ excellent post on a Graph-Based Movie Recommendation engine. I will use many of the same concepts that he describes in his post in order to load the data into Neo4J and then begin to analyze the data. This post will focus on the data loading. Follow-on posts will look at further analysis based on the relationships.

Background

The Federal Election Commission has made campaign contribution data publicly available for download here. The FEC has provided campaign finance maps on its home page. The Sunlight Foundation has created the Influence Explorer to provide similar analysis.

This post and follow-on posts will look at analyzing the Campaign Data using the graph database Neo4j, and the graph traversal language Gremlin. This post will go about showing the data preparation, the data modeling and then loading into Neo4J.

I think the advantage that Dave’s work will have over the Sunlight Foundation “Influence Explorer” is that the “Influence Explorer” has a fairly simple model. Candidate gets money, therefore owned by contributor. To some degree true but how does that work when both sides of an issue are contributing money?

Tracing out the webs of influence that lead to particular positions is going to take something like Neo4j, primed with campaign contribution information but then decorated with other relationships and actors.

BaseX

Filed under: BaseX,XML Database,XPath,XQuery — Patrick Durusau @ 4:29 pm

BaseX

From the webpage:

BaseX is a very light-weight and high-performance XML database system and XPath/XQuery processor, including full support for the W3C Update and Full Text extensions. An interactive and user-friendly GUI frontend gives you great insight into your XML documents and collections.

To maximize your productivity and workflows, we offer professional support, tailor-made software solutions and individual trainings on XML, XQuery and BaseX. The product itself is completely Open Source (BSD-licensed) and platform independent. Join our mailing lists to get regular updates!

But most important: BaseX runs out of the box and is easy to use…

For those of us who don’t think documents, even XML documents, are all that weird. 😉

25 years of Machine Learning Journal

Filed under: Machine Learning — Patrick Durusau @ 4:29 pm

25 years of Machine Learning Journal

KDNuggets reports free access to Machine Learning Journal until 31 October 2011.

Take the time to decide if you need access on a regular basis.

RadioVision: FMA Melds w Echo Nest’s Musical Brain

Filed under: Data Mining,Machine Learning,Natural Language Processing — Patrick Durusau @ 4:28 pm

RadioVision: FMA Melds w Echo Nest’s Musical Brain

From the post:

The Echo Nest has indexed the Free Music Archive catalog, integrating the most incredible music intelligence platform with the finest collection of free music.

The Echo Nest has been called “the most important music company on Earth” for good reason: 12 years of research at UC Berkeley, Columbia and MIT factored into the development of their “musical brain.” The platform combines large-scale data mining, natural language processing, acoustic analysis and machine learning to automatically understand how the online world describes every artist, extract musical attributes like tempo and time signature, learn about music trends (see: “hotttnesss“), and a whole lot more. Echo Nest then shares all of this data through a free and open API. [read more here]

Add music to your topic map!

Neo4j SQL Importer

Filed under: Neo4j,SQL — Patrick Durusau @ 4:28 pm

Neo4j SQL Importer

From Peter Neubauer, from who so many Neo4j goodies come!

In a discussion thread on importing, Rick Otten mentions http://symmetricds.codehaus.org/ having the potential:

to feed data from your Oracle (or other JDBC accessible Relational database) into Neo4j – live, as the data changes.

As Peter replies:

Very interesting – I like!

Would love to play around with it. Thanks for the tip Rick!

Emil Eifrem discusses why Neo4J is releveant to Java Development

Filed under: Java,Neo4j — Patrick Durusau @ 4:27 pm

Emil Eifrem discusses why Neo4J is releveant to Java Development

From the description:

Emil Eifrem has run a technology startup from both Malmo, Sweden and, now, Silicon Valley. He discusses the differences between Silicon Valley-based startups when compared to Swedish startups and he explains the reasons why Neo4J is relevant to Java developers. Emil discusses some of the challenges involved in running a startup and how Neo4J can help address database scalability issues. This interview with O’Reilly Media was conducted at Oracle’s OpenWorld/JavaOne 2011 in San Francisco, CA.

Each to his own but “why Neo4J is relevant to Java Development” isn’t the take away I have from the interview.

It isn’t fair to say the Semantic Web activity is “academic” and that is why it is failing. Google Wave wasn’t an academic project and it failed pretty quickly. I suppose being an academic at heart, I resent the notion that academics are impractical. Some are, some aren’t. Just as all commercial products don’t succeed simply because they have commercial backing. Oh, web servers for example.

Emil’s stronger point is that the Semantic Web does not solve a high priority problem for most users. Solve a problem a few people care about or who aren’t willing to pay the cost to solve, your project isn’t going very far.

Neo4j, for example, solves problems with highly connected data that cannot be addressed without the use of graph databases. That makes graph databases, of which Neo4j is one, very attractive and successful.

My take away: Emil Eifrem on Successful Startups (Ones With Solutions For High Priority Problems).

Great interview Emil!

Alternatives to full text queries

Filed under: Lucene,Solr,Sphinx,Xapian — Patrick Durusau @ 4:27 pm

Alternatives to full text queries (part1)Alternatives…(part2) by Fernando Doglio.

Useful pair of posts but I found the title misleading.

From the post:

Another point of interest to consider is that though on the long run, all four solutions provide very similar services; they do it a bit differently, since they can be categorized into two places:

  • Full text search servers: They provide a finished solution, ready for the developers to install and interact with. You don’t have to integrate them into your application; you only have to interact with them. In here we have Solr and Sphinx.
  • Full text search APIs: They provide the functionalities needed by the developer, but at a lower level. You’ll need to integrate these APIs into your application, instead of just consuming it’s services through a standard interface (like what happens with the servers). In here, we have the Lucene Project and the Xapian project.

But neither option is an “alternative” to “full text queries.” Alternatives to “full text queries” would include LCSH or MeSH or similar systems.

Useful posts as I said but the area is cloudy enough without inventing non-helpful distinctions.

Code For America

Filed under: eGov,Government Data,Marketing — Patrick Durusau @ 4:27 pm

Code For America

I hesitated over this post. But, being willing to promote topic maps for governments, near-governments, governments in the wings, wannabe governments and groups of various kinds opposed by governments, I should not stick at nationalistic or idealistic groups in the United States.

Projects that will benefit from topic maps in government circles work as well in Boston as Mogadishu and Kandahar.

With some adaptation for local goals and priorities but the underlying technical principles remain the same.

At 9/11, the siloed emergency responders could not effectively communicate with each other. Care to guess who can’t effectively communicate with each other in most major metropolitan areas? Just one example of the siloed nature of state, local and city government (To use U.S.-centric terminology. Supply your own local terminology.)

Keep an eye out for the software that is open sourced as a result of this project. Maybe adaptable to your local circumstances or silo. Or you may need a topic map.

Open Net – What a site!

Filed under: Interface Research/Design,Search Interface,Searching — Patrick Durusau @ 4:26 pm

Open Net – What a site!

From the post:

I was doing some poking around to find out about OpenNet (which the Department of State uses), and I came across a DOE implementation of it (they apparently helped invent it.) Clicking the author link works really well! The site is clean and crisp. Very professional looking.

The “Document Categories” list on the Advanced Search page gave me pause.

There are about 70 categories listed, in no discernible order, except for occasional apparent groupings of consecutive listings. One of those groupings, strangely enough, is “Laser Isotope Separation” and “Other Isotope Separation Information”, while “Isotope Separation” is the first category in the entire list. “Other Weapon Topics” is near the end; various weapon categories are sprinkled throughout the list. I guess you have to go through the whole list to see if your weapon of choice is “other”.

Read on. There is discussion of a DOE thesaurus and other analysis that I think you will find useful.

I had to grin at the option under “Declassification Status” that reads Never. Maybe I should not pick that one for any searches that I do. Probably just accessing the site has set off alarm bells at the local FBI office. 😉 BTW, have you seen my tin hat? (Actually “never” means never classified.)

Seriously, this interface is deeply troubled. In part for the reasons cited in the post but also from an interface design perspective. For example, assuming accession number means what it means in most libraries (probably a bad assumption), means that you know the number a particular copy of a document was assigned when it was cataloged in a particular library.

If you know that, why the hell are you “searching” for it? Can’t get much more specific than an accession number, which is unique to a particular library.

Someone did devote a fair amount of time to a help file that makes the interface a little less mystic.

Extra credit: How would you change the interface? Just sketch it out in broad strokes. Say 2-3 pages, no citations.

Optimizing HTTP: Keep-alive and Pipelining

Filed under: Web Applications — Patrick Durusau @ 8:42 am

Optimizing HTTP: Keep-alive and Pipelining by Ilya Grigorik.

From the post:

The last major update to the HTTP spec dates back to 1999, at which time RFC 2616 standardized HTTP 1.1 and introduced the much needed keep-alive and pipelining support. Whereas HTTP 1.0 required strict “single request per connection” model, HTTP 1.1 reversed this behavior: by default, an HTTP 1.1 client and server keep the connection open, unless the client indicates otherwise (via Connection: close header).

Why bother? Setting up a TCP connection is very expensive! Even in an optimized case, a full one-way route between the client and server can take 10-50ms. Now multiply that three times to complete the TCP handshake, and we’re already looking at a 150ms ceiling! Keep-alive allows us to reuse the same connection between different requests and amortize this cost.

The only problem is, more often than not, as developers we tend to forget this. Take a look at your own code, how often do you reuse an HTTP connection? Same problem is found in most API wrappers, and even standard HTTP libraries of most languages, which disable keepalive by default.

I know, way over on the practical side but some topic maps deliver content outside of NSA pipes and some things are important enough to bear repeating. This article covers one of those. Enjoy.

October 14, 2011

Cypher Cookbook

Filed under: Cypher,Graphs,Hypergraphs,Neo4j — Patrick Durusau @ 6:25 pm

Cypher Cookbook

I have been learning to bake bread so you can imagine my disappointment when I saw “Cypher Cookbook” only to find that Peter was talking about Neo4j queries. Really! 😉

From the first entry:

Hyperedges and Cypher

Imagine a user being part of different groups. A group can have different roles, and a user can be part of different groups. He also can have different roles in different groups apart from the membership. The association of a User, a Group and a Role can be referred to as a HyperEdge. However, it can be easily modeled in a property graph as a node that captures this n-ary relationship, as depicted below in the U1G2R1 node.

The graph model is necessary to illustrate the query but hyperedges need to also be treated under modeling or the current Domain Modeling Galllery. I would argue that full examples should be provided for as many domains as possible. (See how easy it is to assign work to others? I don’t know how soon but I hope to be a contributor in that respect.)

Another “cookbook” section could address importing data into Neo4j. Particularly from some of the larger public databases.

If anyone who wants wider adoption of Neo4j needs a motivating example, consider the number of people who use DocBook (its an XML format) versus ODF or OOXML (used by OpenOffice and MS Office (well, MS Office saves as both). If you want wide adoption (which I personally think is a good idea for graph databases) then use can’t be a test of user dedication or integrity.

Couchbase Server 2.0: Most Common Questions (and Answers)

Filed under: Couchbase,NoSQL — Patrick Durusau @ 6:24 pm

Couchbase Server 2.0: Most Common Questions (and Answers) by Perry Krug.

From the post:

I just finished up a nine-week technical webinar series highlighting the features of our upcoming release of Couchbase Server 2.0. It was such a blast interacting with the hundreds of participants, and I was blown away by the level of excitement, engagement and anticipation for this new product.

(By the way, if you missed the series, all nine sessions are available for replay.) There were some great questions generated by users throughout the webinar series, and my original plan was to use this blog entry to highlight them all. I quickly realized there were too many to expect anyone to read through all of them, so I’ve taken a different tack. This blog will feature the most common/important/interesting questions and answer them here for everyone’s benefit. Before diving in, I’ll answer the question that was by far the most commonly asked: “How long until the GA of Couchbase Server 2.0?” We are currently on track to release it before the end of the year. In the meantime, please feel free to experiment with the Developer Preview that is already available. As for the rest of the questions, here goes!

This looks very good but I have a suggestion.

I am going to write to Perry to suggest that he post all the question that came up, wiki style, and let the user community explore answering the questions.

That could be a very useful community project and it would get all the questions that came up out in the open.

MongoGraph – MongoDB Meets the Semantic Web

Filed under: MongoDB,RDF,Semantic Web,SPARQL — Patrick Durusau @ 6:24 pm

MongoGraph – MongoDB Meets the Semantic Web

From the post (Franz Inc.):

Recorded Webcast: MongoGraph – MongoDB Meets the Semantic Web From October 12, 2011

MongoGraph is an effort to bring the Semantic Web to MongoDB developers. We implemented a MongoDB interface to AllegroGraph to give Javascript programmers both Joins and the Semantic Web. JSON objects are automatically translated into triples and both the MongoDB query language and SPARQL work against your objects.

Join us for this webcast to learn more about working on the level of objects instead of individual triples, where an object would be defined as all the triples with the same subject. We’ll discuss the simplicity of the MongoDB interface for working with objects and all the properties of an advanced triplestore, in this case joins through SPARQL queries, automatic indexing of all attributes/values, ACID properties all packaged to deliver a simple entry into the world of the Semantic Web.

I haven’t watched the video, yet, but:

working on the level of objects instead of individual triples, where an object would be defined as all the triples with the same subject.

certainly caught my eye.

Curious, if this means simply using the triples as sources of values and not “reasoning” with them?

Microsoft unites SQL Server with Hadoop

Filed under: Hadoop,SQL Server — Patrick Durusau @ 6:24 pm

Microsoft unites SQL Server with Hadoop by Ted Samson.

From the post:

Microsoft today revealed more details surrounding Windows and SQL Server 12 support for big data analytics via cozier integration with Apache Hadoop, the increasingly popular open source cloud platform for handling the vast quantities of unstructured data spawned daily.

With this move, Microsoft may be able to pull off a feat that has eluded other companies: bring big data to the mainstream. As it stands, only large-scale companies with fat IT budgets have been able to reap that analytical bounty, as the tools on the market tend to be both complex and pricey.

Microsoft’s strategy is to groom Linux-friendlier Hadoop to fit snugly into Windows environments, thus giving organizations on-tap, seamless, and simultaneous access to both structured and unstructured data via familiar desktop apps, such as Excel, as well as BI tools such as Microsoft PowerPivot.

That’s the thing isn’t it? There are only so many DoD size contracts to go around. True enough MS will get their share of those as well (enterprises don’t call the corner IT shop). But the larger market is all the non-supersized enterprises with only internal IT shops and limited budgets.

By making MS apps the information superhighway to information stored/processed elsewhere/elsehow (read non-MS), MS opens up an entire world for its user base. Needs to be seamless but I assume MS will be devoting sufficient resources to that cause.

The more seamless MS makes its apps with non-MS innovations, such as Hadoop, the more attractive its apps become to its user base.

The ultimate irony. Non-MS innovators driving demand for MS products.

Jasondb

Filed under: Jasondb,JSON,NoSQL — Patrick Durusau @ 6:24 pm

Jasondb

From the website:

A Cloud NoSQL JSON Database

I don’t know that you will find this a useful entry into the Cloud/NoSQL race but it does come with comics. 😉

I haven’t signed up for the beta but did skim the blog.

In his design principles, complains about HTTP being slow. Maybe I should send him a pointer to: Optimizing HTTP: Keep-alive and Pipelining. What do you think?

If you join the beta, let me know what you think are strong/weak points from a topic map perspective. Thanks!

OrientDB version 1.0rc6

Filed under: NoSQL,OrientDB — Patrick Durusau @ 6:23 pm

OrientDB version 1.0rc6

From the post:

Hi all,
after some delays the new release is between us: OrientDB 1.0rc6. This is supposed to be the latest SNAPSHOT before official the 1.0.

Before to go in deep with this release I’d like to report you the chance to hack all together against OrientDB & Graph stuff at the next Berlin GraphDB Dojo event: http://www.graph-database.org/2011/09/28/call-for-participations-berlin-dojo/.

Direct download links

OrientDB embedded and server: http://code.google.com/p/orient/downloads/detail?name=orientdb-1.0rc6.zip
OrientDB Graph(ed): http://code.google.com/p/orient/downloads/detail?name=orientdb-graphed-1.0rc6.zip

List of changes

  • SQL engine: improved link navigation (issue 230)
  • Console: new “list databases” command (issue 389)
  • Index: supported composite indexes (issue 405), indexing of collections (issue 554)
  • JPA: supported @Embedded (issue 436) and @Transient annotations
  • Object Database: Disable/Enable lazy loading (issue 563)
  • Server: new Automatic backup task (issue 556), now installable as Windows Service (issue 61)
  • Client: Load balancing in clustered configuration (issue 557)
  • 34 issues closed

This looks great!

I want to call your attention to the composite indexes issue (issue 405). An index built across multiple fields. Hmmm, composite identifiers anyone?

dmoz: computers: artificial intelligence

Filed under: Artificial Intelligence,Data Source — Patrick Durusau @ 6:23 pm

dmoz: computers: artificial intelligence

I ran across this listing of resources, some 1,294 as of today, this morning.

Amusing to note that despite the category being “Artificial Intelligence,” “Programming Languages” shows “(0).”

Before you leap to the defense of dmoz, yes, I know that if you follow the “Programming Languages” link, you will find numerous Lisp resources (as well as others).

Three problems:

First, it isn’t immediately obvious that you should follow “Programming Languages” to find Lisp. After all, it says “(0).” What does that usually mean?

Second, the granularity (or lack thereof) of such a resource listing, enables easier navigation, but at the expense of a lack of detail. Surely post-printed text we can create “views” on the fly that serve the need to navigate as well as varying needs for details or different navigations.

Third, and most importantly from my perspective, how to stay aware of new materials and to find old materials at these sites? RSS feeds can help with changes but doesn’t gather similar reports together and certainly doesn’t help with material already posted.

Another rich lode of resources where delivery could be greatly improved.

Hierarchical Temporal Memory related Papers and Books

Filed under: Artificial Intelligence,Hierarchical Temporal Memory (HTM) — Patrick Durusau @ 6:23 pm

Hierarchical Temporal Memory related Papers and Books

From the post:

I’m writing a report about using Hierarchical Temporal Memory to model kids behaviour learning a second Language. I have Googled many times to find related works. But I noticed that there are just some works related to the HTM. I’ll upload them all here to have a quick reference. I didn’t put link to the original materials to have always a copy of the originals and to be affected by web-site changes. Take note that some of the uploaded contents (in special Numenta Inc. published articles) are licensed and must be used according to the respective License.

I haven’t explored the area, yet, but this is as good a starting point as any.

Hierarchical Temporal Memory

Filed under: CS Lectures,Hierarchical Temporal Memory (HTM),Machine Learning — Patrick Durusau @ 6:23 pm

Hierarchical Temporal Memory: How a Theory of the Neocortex May Lead to Truly Intelligent Machines by Jeff Hawkins.

Don’t skip because of the title!

Hawkins covers his theory of the neocortex but however you feel about that, 2/3 of the presentation is on algorithms, completely new material.

Very cool presentation on “Fixed Sparsity Distributed Representation” and lots of neural science stuff. Need to listen to it again and then read the books/papers.

What I liked about it was the notion that even in very noisy or missing data contexts, that highly reliable identifications can be made.

True enough, Hawkins was talking about vision, etc., but he didn’t bring up any reasons why that could not work in other data environments.

In other words, when can a program treat extra data about a subject as noise and recognize it anyway?

Or if some information is missing about a subject, have a program reliably recognize it.

Or if we only want to store some information and yet have reliable recognition?

Don’t know if any, some or all of those are possible but it is certainly worth finding out.

Description:

Jeff Hawkins (Numenta founder) presents as part of the UBC Department of Computer Science’s Distinguished Lecture Series, March 18, 2010.

Coaxing computers to perform basic acts of perception and robotics, let alone high-level thought, has been difficult. No existing computer can recognize pictures, understand language, or navigate through a cluttered room with anywhere near the facility of a child. Hawkins and his colleagues have developed a model of how the neocortex performs these and other tasks. The theory, called Hierarchical Temporal Memory, explains how the hierarchical structure of the neocortex builds a model of its world and uses this model for inference and prediction. To turn this theory into a useful technology, Hawkins has created a company called Numenta. In this talk Hawkins will describe the theory, its biological basis, and progress in applying Hierarchical Temporal Memory to machine learning problems.

Part of this theory was described in Hawkins’ 2004 book, On Intelligence. Further information can be found at www.Numenta.com

« Newer PostsOlder Posts »

Powered by WordPress