Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

April 13, 2012

Neo4J Tales from the Trenches: A Recommendation Engine Case Study

Filed under: Neo4j,Recommendation — Patrick Durusau @ 4:43 pm

Neo4J Tales from the Trenches: A Recommendation Engine Case Study

25 April 2012 – At 18:30 PM (“Oh to be in London,” he wished. Not for the last time.)

From the post:

In this talk for the Neo4j User Group, Nicki Watt and Michal Bachman present the lessons learned (and being learned) on an active Neo4J project – Opigram.

Opigram is a socially orientated recommendation engine which is already live, with some 150k users and growing. Nicki and Michal will outline their usage of Neo4j, and some of the challenges they have encountered, as well as the approaches and implications taken to address them.

Sound like a good introduction to Neo4j in the context of an actual project.

Sentiment Lexicons (a list)

Filed under: Lexicon,Sentiment Analysis — Patrick Durusau @ 4:42 pm

Sentiment Lexicons (a list)

From the post:

For those interested in sentiment analysis, I culled some of the sentiment lexicons mentioned in Jurafsky’s NLP class lecture 7-3 and also discussed in Chris Potts’ notes here:

Suggestions of other sentiment or other lexicons? The main ones are fairly well known.

The main ones are just that, the main ones. May or may not reflect the sentiment in particular locales.

Probabilistic Programming

Filed under: Church,Probabilistic Programming — Patrick Durusau @ 4:40 pm

Probabilistic Programming by Deniz Yuret.

From the post:

The probabilistic programming language Church brings together two of my favorite subjects: Scheme and Probability. I highly recommend this tutorial to graduate students interested in machine learning and statistical inference. The tutorial explains probabilistic inference through programming starting from simple generative models with biased coins and dice leading up to hierarchical, non-parametric, recursive and nested models. Even at the undergraduate level, I have long thought probability and statistics should be taught in an integrated manner instead of their current almost independent treatment. One roadblock is that even the simplest statistical inference (e.g. three tosses of a coin with an unknown (uniformly distributed) weight results in H, H, T; what is the fourth toss?) requires some calculus at the undergraduate level. Using a programming language like Church may allow an instructor to introduce basic concepts without students getting confused about the details of integration.

Good pointers on probabilistic programming resources. Enjoy!

Operations, machine learning and premature babies

Filed under: Bioinformatics,Biomedical,Machine Learning — Patrick Durusau @ 4:40 pm

Operations, machine learning and premature babies: An astonishing connection between web ops and medical care. By Mike Loukides.

From the post:

Julie Steele and I recently had lunch with Etsy’s John Allspaw and Kellan Elliott-McCrea. I’m not sure how we got there, but we made a connection that was (to me) astonishing between web operations and medical care for premature infants.

I’ve written several times about IBM’s work in neonatal intensive care at the University of Toronto. In any neonatal intensive care unit (NICU), every baby is connected to dozens of monitors. And each monitor is streaming hundreds of readings per second into various data systems. They can generate alerts if anything goes severely out of spec, but in normal operation, they just generate a summary report for the doctor every half hour or so.

IBM discovered that by applying machine learning to the full data stream, they were able to diagnose some dangerous infections a full day before any symptoms were noticeable to a human. That’s amazing in itself, but what’s more important is what they were looking for. I expected them to be looking for telltale spikes or irregularities in the readings: perhaps not serious enough to generate an alarm on their own, but still, the sort of things you’d intuitively expect of a person about to become ill. But according to Anjul Bhambhri, IBM’s Vice President of Big Data, the telltale signal wasn’t spikes or irregularities, but the opposite. There’s a certain normal variation in heart rate, etc., throughout the day, and babies who were about to become sick didn’t exhibit the variation. Their heart rate was too normal; it didn’t change throughout the day as much as it should.

That observation strikes me as revolutionary. It’s easy to detect problems when something goes out of spec: If you have a fever, you know you’re sick. But how do you detect problems that don’t set off an alarm? How many diseases have early symptoms that are too subtle for a human to notice, and only accessible to a machine learning system that can sift through gigabytes of data?

The post goes on to discuss how our servers may exhibit behaviors that machine learning could recognize but that we can’t specify.

That may be Rumsfeld’s “unknown unknowns,” however we all laughed at the time.

There are “unknown unknown’s” and tireless machine learning may be the only way to identify them.

In topic map lingo, I would say there are subjects that we haven’t yet learned to recognize.

Lucene Core 3.6.0 and Solr 3.6.0 Available

Filed under: Lucene,Solr — Patrick Durusau @ 3:01 pm

Lucene Core 3.6.0 and Solr 3.6.0 Available

You weren’t seriously planning on doing Spring cleaning this weekend were you?

Thanks to the Lucene/Solr release, which you naturally have to evaluate before Monday, that has been pushed off another week.

Hopefully something big will drop in the Hadoop ecosystem this coming week or perhaps from one of the graph databases. Will keep an eye out.

The Lucene PMC is pleased to announce the availability of Apache Lucene 3.6.0 and Apache Solr 3.6.0

Lucene can be downloaded from http://lucene.apache.org/core/mirrors-core-latest-redir.html and Solr can be downloaded from http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Highlights of the Lucene release include:

  • In addition to Java 5 and Java 6, this release has now full Java 7 support (minimum JDK 7u1 required).
  • TypeTokenFilter filters tokens based on their TypeAttribute.
  • Fixed offset bugs in a number of CharFilters, Tokenizers and TokenFilters that could lead to exceptions during highlighting.
  • Added phonetic encoders: Metaphone, Soundex, Caverphone, Beider-Morse, etc.
  • CJKBigramFilter and CJKWidthFilter replace CJKTokenizer.
  • Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation.
  • Static index pruning (Carmel pruning) removes postings with low within-document term frequency.
  • QueryParser now interprets ‘*’ as an open end for range queries.
  • FieldValueFilter excludes documents missing the specified field.
  • CheckIndex and IndexUpgrader allow you to specify the specific FSDirectory implementation to use with the new -dir-impl command-line option.
  • FSTs can now do reverse lookup (by output) in certain cases and can be packed to reduce their size. There is now a method to retrieve top N shortest paths from a start node in an FST.
  • New WFSTCompletionLookup suggester supports finer-grained ranking for suggestions.
  • FST based suggesters now use an offline (disk-based) sort, instead of in-memory sort, when pre-sorting the suggestions.
  • ToChildBlockJoinQuery joins in the opposite direction (parent down to child documents).
  • New query-time joining is more flexible (but less performant) than index-time joins.
  • Added HTMLStripCharFilter to strip HTML markup.
  • Security fix: Better prevention of virtual machine SIGSEGVs when using MMapDirectory: Code using cloned IndexInputs of already closed indexes could possibly crash VM, allowing DoS attacks to your application.
  • Many bug fixes.

Highlights of the Solr release include:

  • New SolrJ client connector using Apache Http Components http client (SOLR-2020)
  • Many analyzer factories are now ‘multi term query aware’ allowing for things like field type aware lowercasing when building prefix & wildcard queries. (SOLR-2438)
  • New Kuromoji morphological analyzer tokenizes Japanese text, producing both compound words and their segmentation. (SOLR-3056)
  • Range Faceting (Dates & Numbers) is now supported in distributed search (SOLR-1709)
  • HTMLStripCharFilter has been completely re-implemented, fixing many bugs and greatly improving the performance (LUCENE-3690)
  • StreamingUpdateSolrServer now supports the javabin format (SOLR-1565)
  • New LFU Cache option for use in Solr’s internal caches. (SOLR-2906)
  • Memory performance improvements to all FST based suggesters (SOLR-2888)
  • New WFSTLookupFactory suggester supports finer-grained ranking for suggestions. (LUCENE-3714)
  • New options for configuring the amount of concurrency used in distributed searches (SOLR-3221)
  • Many bug fixes

Neo4j 1.7.M03 – Feature Complete

Filed under: Graphs,Neo4j — Patrick Durusau @ 6:28 am

Neo4j 1.7.M03 – Feature Complete

Andreas Kollegger writes:

The full general release of Neo4j 1.7 is now in view, with this milestone marking feature completeness. This 1.7.M03 release is recommended for migrating your test servers, client applications and drivers in anticipation of 1.7.GA, since there will be no more visible API changes.

Would appreciate your taking a look at: Atomic Array -[:renamed_to]-> Garbage Collection Resistant in the Neo4j blog post.

I may be missing some subtle “graph” humor.

I thought cache invalidation and naming were subtly related. Yes? Or am I being overly technical?

Would be helpful to have a list of the “features” when a release is “feature complete.” Can look at Github for 1.7.M03 but information push is a more effective means of communication.

Attend to the discussion at the Neo4j Google group. (Google, being a proper name, always begins with a capital letter. One of those “naming thing” issues. 😉 )

April 12, 2012

Drizzle: An Open Source Microkernel DBMS for High Performance Scale-Out Applications

Filed under: Database,Drizzle,MySQL — Patrick Durusau @ 7:07 pm

Drizzle: An Open Source Microkernel DBMS for High Performance Scale-Out Applications

From the webpage:

The Global Drizzle Development Team is pleased to announce the immediate availability of Drizzle 7.1.33-stable. The first stable release of Drizzle 7.1 and the result of 12 months of hard work from contributors around the world.

Improvements in Drizzle 7.1 compared to 7.0

  • Xtrabackup is included (in-tree) by Stewart Smith
  • Multi-source replication by David Shrewsbury
  • Improved execute parser by Brian Aker and Vijay Samuel
  • Servers are identified with UUID in replication by Joe Daly
  • HTTP JSON API (experimental) by Stewart Smith
  • Percona Innodb patches merged by Laurynas Biveinis
  • JS plugin: execute JavaScript code as a Drizzle function by Henrik Ingo
  • IPV6 data type by Muhammad Umair
  • Improvements to libdrizzle client library by Andrew Hutchings and Brian Aker
  • Query log plugin and auth_schema by Daniel Nichter
  • ZeroMQ plugin by Markus Eriksson
  • Ability to publish transactions to zeromq and rabbitmq by Marcus Eriksson
  • Replication Dictionary by Brian Aker
  • Log output to syslog is enabled by default by Brian Aker
  • Improvements to logging stats plugin
  • Removal of drizzleadmin utility (you can now do all administration from drizzle client itself) by Andrew Hutchings
  • Improved Regex Plugin by Clint Byrum
  • Improvements to pandora build by Monty Taylor
  • New version numbering system and support for it in pandora-build by Henrik Ingo
  • Updated DEB and RPM packages, by Henrik Ingo
  • Revamped testing system Kewpie all-inclusive with suites of randgen, sysbench, sql-bench, and crashme tests by Patrick Crews
  • Removal of HailDB engine by Stewart Smith
  • Removal of PBMS engine
  • Continued code refactoring by Olaf van der Spek, Brian Aker and others
  • many bug fixes
  • Brian Aker ,Mark Atwood- Continuous Integration
  • Vijay Samuel – Release Manager

From the documentation page:

Drizzle is a transactional, relational, community-driven open-source database that is forked from the popular MySQL database.

The Drizzle team has removed non-essential code, has re-factored the remaining code, and has converted the code to modern C++ and modern libraries.

Charter

  • A database optimized for Cloud infrastructure and Web applications
  • Design for massive concurrency on modern multi-CPU architectures
  • Optimize memory use for increased performance and parallelism
  • Open source, open community, open design

Scope

  • Re-designed modular architecture providing plugins with defined APIs
  • Simple design for ease of use and administration
  • Reliable, ACID transactional

If you like databases and data structure research, now is a wonderful time to be active.

The most important decision in data mining

Filed under: Data Mining,Topic Maps — Patrick Durusau @ 7:06 pm

The most important decision in data mining

A whimsical post that includes this pearl:

It is a fact that a prediction model of the right target is much better than a good prediction model of the wrong or suboptimal target.

Same is true for a topic map except there we would say: A topic map of the right subject(s) is much better than a good topic map of the wrong subject(s).

That means understanding what your clients/users want to talk about. Not what hypothetical Martians might want to talk about. Unless they land and have something of value to trade for modifications to an existing topic map. 😉

The Guide on the Side

Filed under: Education,Interface Research/Design — Patrick Durusau @ 7:05 pm

The Guide on the Side by Meredith Farkas.

From the post:

Many librarians have embraced the use of active learning in their teaching. Moving away from lectures and toward activities that get students using the skills they’re learning can lead to more meaningful learning experiences. It’s one thing to tell someone how to do something, but to have them actually do it themselves, with expert guidance, makes it much more likely that they’ll be able to do it later on their own.

Replicating that same “guide on the side” model online, however, has proven difficult. Librarians, like most instructors, have largely gone back to a lecture model of delivering instruction. Certainly it’s a great deal more difficult to develop active learning exercises, or even interactivity, in online instruction, but many of the tools and techniques that have been embraced by librarians for developing online tutorials and other learning objects do not allow students to practice what they’re learning while they’re learning. While some software for creating screencasts—video tutorials that film activity on one’s desktop—include the ability to create quizzes or interactive components, users can’t easily work with a library resource and watch a screencast at the same time.

In 2000, the reference desk staff at the University of Arizona was looking for an effective way to build web-based tutorials to embed in a class that had resulted in a lot of traffic at the reference desk. Not convinced of the efficacy of traditional tutorials to instruct students on using databases, the librarians “began using a more step-by-step approach where students were guided to perform specific searches and locate specific articles,” Instructional Services Librarian Leslie Sult told me. The students were then assessed on their ability to conduct searches in the specific resources assigned. Later, Sult, Mike Hagedon, and Justin Spargur of the library’s scholarly publishing and data management team, turned this early active learning tutorial model into Guide on the Side software.

Guide on the Side is an interface that allows librarians at all levels of technological skill to easily develop a tutorial that resides in an online box beside a live web page students can use. Students can read the instructions provided by the librarian while actively using a database, without needing to switch between screens. This allows students to use a database while still receiving expert guidance, much like they could in the classroom.

Meredith goes on to provide links to examples of such “Guide on the Side” resources and promises code to appear on GitHub early this summer.

This looks like a wonderful way to teach topic maps.

Comments/suggestions?

From Zero to Machine Learning in Less than Seven Minutes

Filed under: Machine Learning — Patrick Durusau @ 7:05 pm

From Zero to Machine Learning in Less than Seven Minutes by Charles Parker.

From the post:

Here at BigML, we do a lot of work trying to make machine learning accessible. This involves a lot of thought about everything from classification algorithms, to data visualization, to infrastructure, databases, particle physics, and security.

Okay, not particle physics. But definitely all of that other stuff.

After all that thinking, our hope is that we’ve built something that non-experts can use to build data-driven decisions into their applications and business logic. To get you started, we’ve made a series of short videos showing the key features of the site. Watch and learn. Machine learning is only seven minutes away.

An impressively done “…less than seven minutes. Watch all the videos and in particular watch for the “live pruning slider.” Worth the time you will spend on the videos.

It elides over many of the difficulties found in machine learning, but isn’t that part of being a service? That is if you tooled this by hand, there would be a lot more detail and choices at every turn.

By reducing the number of options and choices, as well as glossing over some of the explanations, this service may bring machine learning to a larger user population.

What would it look like to do something similar for topic maps?

Thoughts?

Up to Speed on R Graphics

Filed under: Graphics,R,Visualization — Patrick Durusau @ 7:05 pm

Up to Speed on R Graphics

Steve Miller writes:

One of the new OpenBI hires tasked with getting up to speed on R approached me a few weeks back to solicit my recommendations for books/websites to turbo-charge her understanding of R graphics.

I was only too happy to offer my $.02 on materials I’ve found productive over the years, hoping to spare her a bit of the learning curve I experienced on my own 10 years ago. Fortunately, R’s statistical graphics are top-notch and there’s now no shortage of excellent sources for eager students.

Rather appropriate about my post about interpreting charts don’t you think? 😉

Seriously, this is a great resource for R graphics resources.

You may understand your findings but if you can’t communicate them effectively, such as by graphic presentation, you will be the only one who understands your findings. Doesn’t lead to a lot of repeat work.

A chart that stops the story-telling impetus

Filed under: Graphs,Infographics — Patrick Durusau @ 7:05 pm

A chart that stops the story-telling impetus

From the post:

We all like to tell stories. One device that has produced a lot of stories, and provoked much imagination is the dual-axis plot showing two time series. Is there a correlation or is there not? Unfortunately, most of these stories are false.

The post proceeds to illustrate that the relationship it depicts isn’t present in another presentation of the data. (Using “…home sales and median home price in Claremont over the last six years…” as a data set.

I don’t disagree that a different depiction of the same data is, well, different, but that was the point of the exercise. Yes?

That is to say that I would not make a chart of data that contradicted some point I was trying to make in an argument. Or at least that I understood was contradicting some point I was trying to make.

My personal rule is that when someone shows me a chart, statistics, test results, analysis of any sort, they are trying to persuade me that one or more facts are the case. What else would they be trying to do? (There is annoy me but let’s set that case to one side.)

I think library students and others need to be aware that vendors use charts and other means of persuasion because they are marketing a product. Not in bad faith because they may really believe their product will suit your needs as well as their need for a sale. A win-win situation.

What you need to do is push back with your understanding of the “facts,” with your own charts or interpretation of their charts.

Just as a tip, have your needs and your users’ needs depicted in colorful charts for sales meetings. So you can put a big red X on any feature you need that the vendor doesn’t offer. That is the card/chart you need to have on top of the stack at all times.

From Beaker to Bits: Graph Theory Yields Computational Model of Human Tissue

Filed under: Bioinformatics,Biomedical,Graphs — Patrick Durusau @ 7:04 pm

From Beaker to Bits: Graph Theory Yields Computational Model of Human Tissue

An all too seldom example of how reaching across disciplinary lines can lead to fundamental breakthroughs in more than one area.

First step, alert any graph or data store people you know, along with any medical research types.

Second step, if you are in CS/Math, think about another department that interests you. If you are in other sciences or humanities, strike up a conversation with the CS/Math department types.

In both cases, don’t take “no” or lack of interest as an answer. Talk to the newest faculty or even faculty at other institutions. Or even established companies.

No guarantees that you will strike up a successful collaboration, much less have a successful result. But, we all know how successful a project that never begins will be, don’t we?

Here is a story of a collaborative project that persisted and succeeded:

Computer scientists and biologists in the Data Science Research Center at Rensselaer Polytechnic Institute have developed a rare collaboration between the two very different fields to pick apart a fundamental roadblock to progress in modern medicine. Their unique partnership has uncovered a new computational model called “cell graphs” that links the structure of human tissue to its corresponding biological function. The tool is a promising step in the effort to bring the power of computational science together with traditional biology to the fight against human diseases, such as cancer.

The discovery follows a more than six-year collaboration, breaking ground in both fields. The work will serve as a new method to understand and predict relationships between the cells and tissues in the human body, which is essential to detect, diagnose and treat human disease. It also serves as an important reminder of the power of collaboration in the scientific process.

The new research led by Professor of Biology George Plopper and Professor of Computer Science Bulent Yener is published in the March 30, 2012, edition of the journal PLoS One in a paper titled, “ Coupled Analysis of in Vitro and Histology Tissue Samples to Quantify Structure-Function Relationship.” They were joined in the research by Evrim Acar, a graduate student at Rensselaer in Yener’s lab currently at the University of Copenhagen. The research is funded by the National Institutes of Health and the Villum Foundation.

The new, purely computational tool models the relationship between the structure and function of different tissues in body. As an example of this process, the new paper analyzes the structure and function of healthy and cancerous brain, breast and bone tissues. The model can be used to determine computationally whether a tissue sample is cancerous or not, rather than relying on the human eye as is currently done by pathologists around the world each day. The objective technique can be used to eliminate differences of opinion between doctors and as a training tool for new cancer pathologists, according to Yener and Plopper. The tool also helps fill an important gap in biological knowledge, they said.

BTW, if you want to see all the details: Coupled Analysis of in Vitro and Histology Tissue Samples to Quantify Structure-Function Relationship

The CloudFormation Circle of Life : Part 1

Filed under: Amazon Web Services AWS,Cloud Computing — Patrick Durusau @ 7:04 pm

The CloudFormation Circle of Life : Part 1

From the post:

AWS CloudFormation makes it easier for you to create, update, and manage your AWS resources in a predictable way. Today, we are announcing a new feature for AWS CloudFormation that allows you to add or remove resources from your running stack, enabling your stack to evolve as its requirements change over time. With AWS CloudFormation, you can now manage the complete lifecycle the AWS resources powering your application.

I think there is a name for this sort of thing. Innovation, that’s right! That’s the name for it!

As topic map services move into the clouds, being able to take advantage of resource stacks is likely to be important. Particularly if you have mapping empowered resources that can be placed in a stack of resources.

The “cloud” in general looks like an opportunity to move away from ETL (Extract-Transform-Load) into more of an ET (Extract-Transform) model. Particularly if you take a functional view of data. Will save on storage costs, particularly if the data sets are quite large.

Definitely a service that anyone working with topic maps in the cloud needs to know more about.

30 Places to Find Open Data on the Web

Filed under: Data,Dataset — Patrick Durusau @ 7:04 pm

30 Places to Find Open Data on the Web by Romy Misra.

From the post:

Finding an interesting data set and a story it tells can be the most difficult part of producing an infographic or data visualization.

Data visualization is the end artifact, but it involves multiple steps – finding reliable data, getting the data in the right format, cleaning it up (an often underestimated step in the amount of time it takes!) and then finding the story you will eventually visualize.

Following is a list useful resources for finding data. Your needs will vary from one project to another, but this list is a great place to start — and bookmark.

A very good collection of data sources.

From the comments as of April 10, 2012, you may also want to consider:

http://data.gov.uk/

http://thedatahub.org/

http://www.freebase.com/

(The photography link in the comments is spam, don’t bother.)

Other data sources that you would suggest?

Is There A Dictionary In The House? (Savanna – Think Software)

Filed under: Integration,Intelligence,OWL,Semantic Web — Patrick Durusau @ 7:04 pm

Reading a white paper on an integration solution from Thetus Corporation (on its Savanna product line) when I encountered:

Savanna supports the core architectural premise that the integration of external services and components is an essential element of any enterprise platform by providing out-of-the-box integrations with many of the technologies and programs already in use in the DI2E framework. These investments include existing programs, such as: the Intelligence Community Data Layer (ICDL), OPTIC (force protection application), WATCHDOG (Terrorist Watchlist 2.0), SERENGETI (AFRICOM socio-cultural analysis), SCAN-R (EUCOM deep futures analysis); and, in the future: TAC (tripwire search and analysis), and HSCB-funded modeling capabilities, including Signature Analyst and others. To further make use of existing external services and components, the proposed solution includes integration points for commercial and opensource software, including: SOLR (indexing), Open Sextant (geotagging), Apache OpenNLP (entity extraction), R (statistical analysis), ESRI (geo-processing), OpenSGI GeoCache (geospatial data), i2 Analyst’s Notebook (charting and analysis) and a variety of structured and unstructured data repositories.

I have to plead ignorance of the “existing program” alphabet soup but I am familiar with several of the open source packages.

I am not sure what an “integration point” for an unknown future use of any of those packages would look like. Do you? Their output can be used by any program but that hardly qualifies the other program as having an “integration point.”

I am sensitive to the use of “integration” because to me it means there is some basis for integration. So a user having integrated data once, can re-use and possibly enhance the basis for integration of data with other data. (We call that “merging” in topic map land.)

Integration and even reuse is mentioned: “The Savanna architecture prevents creating a set of comparable reuse issues at the enterprise scale by providing a set of interconnected and flexible models that articulate how analysis assets are sourced and created and how they are used by the community.” (page 16)

But not in enough detail to really evaluate the basis for re-use of data, data structures, enrichment of the same, etc.

Looked around for an SDK or such but came up empty.

Point of amusement:

It’s official, we’re debuting our newest release of Savanna at DoDIIS (March 21, 2012) (Department of Defense Intelligence Information Systems Worldwide Conference (DoDIIS))

The next blog entry by date?

Happy Peaceful Birthday to the Peace Corps (March 1, 2012)

I would appreciate hearing from anyone with information or stories to tell about how Savanna works in practice.

In particular I am interested in whether two distinct Savanna installations can share information in a blind interchange? That should be the test of re-use of information by another installation.

Moreover, do I have to convert data between formats or can data structures themselves be entities with properties?

PS: I am not overly impressed with the use of OWL for modeling in Savanna. The experience with “big data” has shown that starting with data first leads to different, perhaps more useful models than the other way around.

Premature modeling with OWL will result in models that are “useful” in meeting the expectations of the creating analyst. That may not be the criteria of “usefulness” that is required.

Amazon CloudSearch – Start Searching in One Hour for Less Than $100 / Month

Filed under: Amazon CloudSearch,Search Engines,Searching — Patrick Durusau @ 10:50 am

Amazon CloudSearch – Start Searching in One Hour for Less Than $100 / Month

Jeff Barr, AWS Evangelist, has the easiest job on the Net! How hard can it be to bring “good news” (the original meaning of evangelist) when it just pours out from AWS. If you are an Amazon veep or some such, be assured that managing that much good news is hard. What do you say first?

From the post:

Continuing along in our quest to give you the tools that you need to build ridiculously powerful web sites and applications in no time flat at the lowest possible cost, I’d like to introduce you to Amazon CloudSearch. If you have ever searched Amazon.com, you’ve already used the technology that underlies CloudSearch. You can now have a very powerful and scalable search system (indexing and retrieval) up and running in less than an hour.

You, sitting in your corporate cubicle, your coffee shop, or your dorm room, now have access to search technology at a very affordable price. You can start to take advantage of many years of Amazon R&D in the search space for just $0.12 per hour (I’ll talk about pricing in depth later).

What is Search?

Search plays a major role in many web sites and other types of online applications. The basic model is seemingly simple. Think of your set of documents or your data collection as a book or a catalog, composed of a number of pages. You know that you can find the desired content quickly and efficiently by simply consulting the index.

Search does the same thing by indexing each document in a way that facilitates rapid retrieval. You enter some terms into a search box and the site responds (rather quickly if you use CloudSearch) with a list of pages that match the search terms.

My only quibble with the announcement is that it makes search sound too easy. Jeff does mention all the complex things you can do but a casual reader is left with the impression that search isn’t all that hard.

Well, I suppose search isn’t that hard but good searching is. Some very large concerns that have made mediocre searching a real cash cow.

That model, the mediocre searching model, may not work for you. In that case, you can still use Amazon CloudSearch but you had best get some expert searching advice to go along with it.

Sqoop Graduation Meetup

Filed under: Cloudera,Sqoop — Patrick Durusau @ 9:23 am

Sqoop Graduation Meetup by Kathleen Ting.

From the post:

Cloudera hosted the Apache Sqoop Meetup last week at Cloudera HQ in Palo Alto. About 20 of the Meetup attendees had not used Sqoop before, but were interested enough to participate in the Meetup on April 4th. We believe this healthy interest in Sqoop will contribute to its wide adoption.

Not only was this Sqoop’s second Meetup but also a celebration for Sqoop’s graduation from the Incubator, cementing its status as a Top-Level Project in Apache Software Foundation. Sqoop’s come a long way since its beginnings three years ago as a contrib module for Apache Hadoop submitted by Aaron Kimball. As a result, it was fitting that Aaron gave the first talk of the night by discussing its history: “Sqoop: The Early Days.” From Aaron, we learned that Sqoop’s original name was “SQLImport” and that it was conceived out of his frustration from the inability to easily query both unstructured and structured data at the same time. (Emphasis added.)

I don’t think the extra 20 people were present because of Sqoop.

Did you see the picture of the cake?

My vote goes for the cake as explanation. Yours? 😉

Congratulations to Sqoop, Sqoop team and community!

Let’s make sure on its first birthday a bigger cake is required!

Red Hat and 10gen: Deeper collaboration around MongoDB

Filed under: MongoDB,Red Hat — Patrick Durusau @ 8:49 am

Red Hat and 10gen: Deeper collaboration around MongoDB

From the post:

Today [April 9, 2012], Red Hat and 10gen jointly announced a deeper collaboration around MongoDB. By combining Red Hat’s traditional strengths in operating systems and middleware with 10gen’s expertise in database technology, we’re developing a robust open source platform on which to develop and deploy your next generation of applications either in your own data centers or in the cloud.

Over the next several months, we’ll be working closely with Red Hat to optimize and integrate MongoDB with a number of Red Hat products. You can look at this effort resulting in a set of reference designs, solutions, packages and documentation for deploying high-performance, scalable and secure applications with MongoDB and Red Hat software. Our first collaboration is around a blueprint for deploying MongoDB on Red Hat Enterprise Linux, which we will release shortly. We’ll follow that up with a number of additional projects around RHEL, JBoss, Red Hat Enterprise Virtualization (RHEV), Cloud Forms, Red Hat Storage (GlusterFS), and of course continue the work we have started with OpenShift. We hope to get much involvement from the Red Hat and MongoDB communities, and any enhancements to MongoDB resulting from this work will, of course, be open sourced.

Have you noticed that open source projects are trending towards bundling themselves with each other?

A healthy recognition users want solutions over sporting with versions and configuration files.

April 11, 2012

Close Counts In Horseshoes, Hand Grenades and Clustering

Filed under: Clustering,Machine Learning,R — Patrick Durusau @ 6:18 pm

Machine Learning in R: Clustering by Ricky Ho.

Ricky writes:

Clustering is a very common technique in unsupervised machine learning to discover groups of data that are “close-by” to each other. It is broadly used in customer segmentation and outlier detection.

It is based on some notion of “distance” (the inverse of similarity) between data points and use that to identify data points that are close-by to each other. In the following, we discuss some very basic algorithms to come up with clusters, and use R as examples.

Covers K-Means, Hierarchical Clustering, Fuzzy C-Means, Multi-Gaussian with Expectation-Maximization, and Density-based Cluster algorithms.

Good introduction to the basics of clustering in R.

The Heat in SharePoint Semantics June 30 to July 6

Filed under: SharePoint,Topic Maps — Patrick Durusau @ 6:17 pm

The Heat in SharePoint Semantics June 30 to July 6

For a paid plug for SharePoint it isn’t bad. (The Trend Point) Covers resources for new users and those who have discovered they need something in addition to SharePoint (second day users).

The poor performance of SharePoint on findability opens up a robust after-market for topic map products. It is by no means the only or even poorest content product that could benefit from the addition of topic maps. It is one of the more widely used ones and hence there is more commercial opportunity available.

I know that Networked Planet offers topic map based remedies for Sharepoint. Others?

PS: I have no explanation for the odd titling of the original post or the rather odd clustering of links in the article. Try them and you will see what I mean.

Timeline Maps

Filed under: Mapping,Maps,Time,Timelines — Patrick Durusau @ 6:17 pm

Timeline Maps

From the post:

Mapping time has long been an interest of cartographers. Visualizing historical events in a timeline or chart or diagram is an effective way to show the rise and fall of empires and states, religious history, and important human and natural occurrences. We have over 100 examples in the Rumsey Map Collection, ranging in date from 1770 to 1967. We highlight a few below.

Sebastian Adams’ 1881 Synchronological Chart of Universal History is 23 feet long and shows 5,885 years of history, from 4004 B.C. to 1881 A.D. It is the longest timeline we have seen. The recently published Cartographies of Time calls it “nineteenth-century America’s surpassing achievement in complexity and synthetic power.” In the key to the map, Adams states that timeline maps enable learning and comprehension “through the eye to the mind.”

Below is a close up detail of a very small part of the chart: (click on the title or the image to open up the full chart)

Stunning visuals.

Our present day narratives aren’t any less arrogant than those of the 19th century but the distance is great enough for us to laugh at their presumption. Which unlike our own, isn’t “true.” 😉

Worth all the time you can spend with the maps. Likely to provoke insights into how you have viewed “history” as well as how you view current “events.”

Tukey: Integrated Cloud Services For Big Data

Filed under: BigData,Cloud Computing — Patrick Durusau @ 6:17 pm

Open Cloud Consortium Announces First Integrated Set of Cloud Services for Researchers Working with Big Data

I just could not bring myself to use the original title for the post.

BTW, I mis-read the service to be named: “Turkey.” Maybe that is how it is pronounced?

Service for researchers, described as:

Today, the Open Cloud Consortium (OCC) announced the availability of Tukey, which is an innovative integrated set of cloud services designed specifically to enable scientific researchers to manage, analyze and make discoveries with big data.

Several public cloud service providers provide resources for individual scientists and small research groups, and large research groups can build their own dedicated infrastructure for big data. However,currently, there is no cloud service provider that is focused on providing services to projects that must work with big data, but are not large enough to build their own dedicated clouds.

Tukey is the first set of integrated cloud services to fill this niche.

Tukey was developed by the Open Cloud Consortium, a not-for-profit multi-organizational partnership. Many scientific projects are more comfortable hosting their data with a not-for-profit organization than with a commercial cloud service provider.

Cloud Service Providers (CSP) that are focused on meeting the needs of the research community are beginning to be called Science Cloud Service Providers or Sci CSPs (pronounced psi-sip). Cloud Service Providers serving the scientific community must support the long term archiving of data, large data flows so that large datasets can be easily imported and exported, parallel processing frameworks for analyzing large datasets, and high end computing.

“The Open Cloud Consortium is one of the first examples of an innovative resource that is being called a Science Cloud Service Provider or Sci CSP,” says Robert Grossman, Director of the Open Cloud Consortium. “Tukey makes it easy for scientific research projects to manage, analyze and share big data, something this is quite difficult to do with the services from commercial Cloud Service Providers.”

The beta version of Tukey is being used by several research projects, including: the Matsu Project, which hosts over two years of data from NASA’s EO-1 satellite; Bionimbus, which is a system for managing, analyzing, and sharing large genomic datasets; and bookworm, which is an applications that extracts patterns from large collections of books.

The services include: hosting large public scientific datasets; standard installations of the open source OpenStack and Eucalyptus systems, which provide instant on demand computing infrastructure; standard installations of the open source Hadoop system, which is the most popular platform for processing big data; standard installations of UDT, which is a protocol for transporting large datasets; and a variety of domain specific applications.

It isn’t clear to me what short-comings of commercial cloud providers are being addressed?

Many researchers can’t build their own clouds but with commercial cloud providers, why would you want to?

Or take the claim:

“Tukey makes it easy for scientific research projects to manage, analyze and share big data, something this is quite difficult to do with the services from commercial Cloud Service Providers.”

How so? What prevents this with commercial cloud providers? Being on different clouds? But Tukey “corrects” this by requiring membership on its cloud. How is that any better?

Nothing against Tukey but I think being a non-profit isn’t enough of a justification for yet another cloud. What else makes it different from other clouds?

Clouds are important for topic maps as semantics will collide in clouds and making meaningful semantics rain from clouds is going to require topic maps or something quite similar.

Whamcloud, EMC Collaborate on PLFS and Lustre Integration

Filed under: Lustre,Parallel Programming,PLFS — Patrick Durusau @ 6:17 pm

Whamcloud, EMC Collaborate on PLFS and Lustre Integration

From the post:

Whamcloud, a venture-backed company formed from a worldwide network of high-performance computing (HPC) storage industry veterans, today announced it is extending its working relationship with EMC Corporation (NYSE:EMC). The relationship between the two companies began over a year ago and promotes the open source availability of Lustre. Whamcloud and EMC, a fellow member of the OpenSFS consortium, are extending their collaboration for an additional year.

Whamcloud and EMC will continue working together to provide deeper integration between the Parallel Log-structured File System (PLFS) and Lustre. As part of their joint efforts, Whamcloud and EMC will continue augmenting Lustre’s IO functionality, including the enhancement of small file IO and metadata performance. The two companies will look for multiple ways to contribute to the future feature development of Lustre.

PLFS is a parallel IO abstraction layer that rearranges unstructured, concurrent writes by many clients into sequential writes to unique files (N-1 into N-N) to improve the efficiency of the underlying parallel filesystem. PLFS can reduce checkpoint time by up to several orders of magnitude. Lustre is an open source massively parallel file system, generally used for large scale cluster computing. It is found in over 60% of the TOP100 supercomputing sites.

Less important for the business news aspects but more important as a heads up on Lustre and PLFS.

Parallel semantic monogamy is one thing. Parallel semantic heterogeneity is another. Will your name/company be associated with solutions for the later?

2012 PyData Workshop Videos

Filed under: BigData,Conferences,Python — Patrick Durusau @ 6:16 pm

2012 PyData Workshop Videos

From the webpage:

Check out these videos from the 2012 PyData Workshop, held on March 2nd & 3rd at the Googleplex in Mountain View, CA, and attended by a core group of data scientists interested in Python and Pythonistas interested in big data.

Joining a solid line-up of speakers was Guido van Rossum, author of the Python language, who engaged in an open panel discussion on the intersection of the Python language and the growth of the scientific community.

You can find all currently published videos below, but stay tuned as we’ll be releasing videos for many of the talks from this two day event.

The videos thus far (10 April 2012):

  • 2012 PyData Workshop Panel with Guido van Rossum
  • Python in Big Data with an overview of NumPy & SciPy
  • The Disco MapReduce Framework
  • Image Processing in Python with scikits-image
  • Boosting NumPy with Numexpr and Cython
  • Data Analysis in Python with Pandas
  • Advanced matplotlib Tutorial with library author John Hunter

Subjects are out there, you just have to find them.

GovTrack Adds Probabilities to Bill Prognosis

Filed under: Law,Legal Informatics — Patrick Durusau @ 6:16 pm

GovTrack Adds Probabilities to Bill Prognosis

From the post:

a href=”http://razor.occams.info/”>Dr. Joshua Tauberer of GovTrack has posted Even Better Bill Prognosis: Now with Real Probabilities, on the GovTrack Blog.

In this post, Dr. Tauberer describes the new probability-of-passage figure added to GovTrack’s bill prognosis feature. According to the post:

The analysis has a lot of the factors you would expect but more are certainly possible. Topic maps certainly would be a way to help discover additional factors that should be added.

Personally I favor a “show me the money” type analysis for political decision making processes.

Calculating Word and N-Gram Statistics from the Gutenberg Corpus

Filed under: Gutenberg Corpus,N-Gram,NLTK,Statistics — Patrick Durusau @ 6:16 pm

Calculating Word and N-Gram Statistics from the Gutenberg Corpus by Richard Marsden.

From the post:

Following on from the previous article about scanning text files for word statistics, I shall extend this to use real large corpora. First we shall use this script to create statistics for the entire Gutenberg English language corpus. Next I shall do the same with the entire English language Wikipedia.

A “get your feet wet” sort of exercise with the script included.

The Gutenberg project isn’t “big data” but it is more than your usual inbox.

Think of it as learning about the data set for application of more sophisticated algorithms.

Reviews and Natural Language Processing: Clustering

Filed under: Clustering,Natural Language Processing — Patrick Durusau @ 6:15 pm

Reviews and Natural Language Processing: Clustering

From the post:

This quote initiated a Natural Language investigation into the HomeAway Review corpus: do the Traveler reviews (of properties) adhere to some set of standards? Reviews contain text and a “star” rating; does the text align with the rating? Analyzing its various corpora with Natural Language Processing tools allows HomeAway to better listen to – and better serve – its customers.

Interesting. Home Away is a vacation rental marketplace and so has a pressing interest in the analysis of reviews.

Promises to be a very good grounding in NLP as applied to reviews. Worth watching closely.

Open Street Map GPS users mapped

Filed under: GPS,Mapping,Maps,Open Street Map — Patrick Durusau @ 6:15 pm

Open Street Map GPS users mapped

From the post:

Open Street Map is the data source that keeps on giving. Most recently, the latest release has been a dump of GPS data from its contributors. These are the track files from Sat Nav systems which they users have sourced for the raw data behind OSM.

It’s a huge dataset: 55GB and 2.8bn items. And Guardian Datastore Flickr group user Steven Kay decided to try to visualise it.

This is the result – and it’s only an random sample of the whole. The heatmap shows a random sample of 1% of the points and their distribution, to show where GPS is used to upload data to OSM.

There are just short of 2.8 billion points, so the sample is nearly 28 million points. Red cells have the most points, blue cells have the fewest.

Great data set on its own but possibly the foundation for something even more interesting.

The intelligence types, who can’t analyze a small haystack effectively, want to build a bigger one: Building a Bigger Haystack.

Why not use GPS data such as this to create an “Intelligence Big Data Mining Test?” That is we assign significance to patterns in the data and see of the intelligence side can come up with the same answers. We can tell them what the answers are because they must still demonstrate how they got there, not just the answer.

Wavii: New Kind Of News Gatherer – (Donii?)

Filed under: Artificial Intelligence,News,Summarization,Wavii — Patrick Durusau @ 4:58 pm

Wavii: New Kind Of News Gatherer by Thomas Claburn.

Wavii, a new breed of aggregator, gives you news feeds culled from across the Web, from sources far beyond Google News. It also understands your interests and summarizes results.

From the post:

Imagine being able to follow topics rather than people on social networks. Imagine a Google Alert that arrived because Google actually had some understanding of your interests beyond what can be gleaned from the keywords you provided. That’s basically what Wavii, entering open beta testing on Wednesday, makes possible: It offers a way to follow topics or concepts and to receive updates in an automatically generated summary format.

Founded in 2009 by Adrian Aoun, an entrepreneur and former employee of Microsoft and Fox Media Interactive, Wavii provides users with news feeds culled from across the Web that can be accessed via Wavii’s website or mobile app. Unlike Google Alerts, these feeds are composed from content beyond Google News. Wavii gathers its information from all over the Web–news, videos, tweets, and beyond–and then attempts to make sense of what it has found using machine learning techniques.

Wavii is not just a pattern-matching system. It recognizes linguistic concepts and that understanding makes its assistance more valuable: Not only is Wavii good at finding information that matches a user’s expressed interests but it also concisely summarizes that information. The company has succeeded at a task that other companies haven’t managed to do quite as well.

Sounds interesting. After the initial rush I will sign up for test drive.

The story did not report what economic model that Wavii will be following? I assume the server space and CPU cycles plus staff time aren’t being donated. Yes? Wonder why that wasn’t worth mentioning. You?

BTW, let’s not be like television where if there is one housewife hooker show successful this season, next season there will be higher and lower end housewife’s doing the same thing and next year, well, let’s just say one of the partners will be non-human.

Here’s my alternative: Donii – Donii reports donations to you from within 2 degrees of separation of the person in front of you. Custom level settings: Hug; Nod Encouragingly; Glad Hand; Look For Someone Else, Anyone Else.

« Newer PostsOlder Posts »

Powered by WordPress