Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 17, 2011

Content Analysis

Filed under: Content Analysis,Law - Sources,Legal Informatics,Text Analytics — Patrick Durusau @ 6:33 am

Content Analysis by Michael Heise.

From the post:

Dan Katz (MSU) let me know about a beta release of new website, Legal Language Explorer, that will likely interest anyone who does content analysis as well as those looking for a neat (and, according to Jason Mazzone, addictive) toy to burn some time. The site, according to Dan, allows users: “the chance [free of charge] to search the history of the United States Supreme Court (1791-2005) for any phrase and get a frequency plot and the full text case results for that phrase.” Dan also reports that the developers hope to expand coverage beyond Supreme Court decisions in the future.

The site needs a For Amusement Only sticker. Legal language changes over time and probably no place more so than in Supreme Court decisions.

It was a standing joke in law school that the bar association sponsored the “Avoid Probate” sort of books. If you really want to incur legal fees, just try self-help. Same is true for this site. Use it to argue with your friends, settle bets during football games, etc. Don’t rely on it during night time, road side encounters with folks carrying weapons and radios to summons help. (police)

Google; almost 50 functions & resources killed in 2011

Filed under: Search Interface,Searching,Semantic Diversity — Patrick Durusau @ 6:32 am

Google; almost 50 functions & resources killed in 2011 by Phil Bradley.

Just in case you want to think of other potential projects over the holidays! 😉

For my topic maps class:

  1. Pick one function or resource
  2. Outline how semantic integration could support or enhance such a function or resource. (3-5 pages, no cites)
  3. Bonus points: What resources would you want to integrate for such a function or resource? (1-2 pages)

Google removes more search functionality

Filed under: Advertising,Search Engines,Search Interface,Searching — Patrick Durusau @ 6:32 am

Google removes more search functionality by Phil Bradley.

From the post:

In Google’s apparently lemming like attempt to throw as much search functionality away as they can, they have now revamped their advanced search page. Regular readers will recall that I wrote about Google making it harder to find, and now they’re reducing the available options. The screen is now following the usual grey/white/read design, but to refresh your memory, this is what it used to look like:

Just in case you are looking for search opportunities in the near future.

The smart money says to not try to be everything to everybody. Pick off a popular (read advertising supporting) subpart of all content and work up really well. Offer users for that area what seem like useful defaults for that area. The defaults for television/movie types are likely to be different from the Guns & Ammo crowd. As would the advertising you would sell.

Remind me to write about using topic maps to create pull-model advertising. So that viewers pre-qualify themselves and you can charge more for “hits” on ads.

Decoding jQuery

Filed under: JQuery — Patrick Durusau @ 6:32 am

Decoding jQuery by Shi Chuan.

From the introduction:

Open source is not open enough. To open the open source, in the Decoding jQuery series, we will break down every single method in jQuery, to study the beauty of the framework, as an appreciation to the creative geniuses behind it.

What looks like a promising series on jQuery!

As you can tell from my recent posts on Nutch, I hope this coming year will see more implementation/application side posts.

IBM Redbooks Reveals Content Analytics

Filed under: Analytics,Data Mining,Entity Extraction,Text Analytics — Patrick Durusau @ 6:31 am

IBM Redbooks Reveals Content Analytics

From Beyond Search:

IBM Redbooks has put out some juicy reading for the azure chip consultants wanting to get smart quickly with IBM Content Analytics Version 2.2: Discovering Actionable Insight from Your Content. The sixteen chapters of this book take the reader from an overview of IBM content analytics, through understanding the details, to troubleshooting tips. The above link provides an abstract of the book, as well as links to download it as a PDF, view in HTML/Java, or order a hardcopy.

Abstract:

With IBMÂŽ Content Analytics Version 2.2, you can unlock the value of unstructured content and gain new business insight. IBM Content Analytics Version 2.2 provides a robust interface for exploratory analytics of unstructured content. It empowers a new class of analytical applications that use this content. Through content analysis, IBM Content Analytics provides enterprises with tools to better identify new revenue opportunities, improve customer satisfaction, and provide early problem detection.

To help you achieve the most from your unstructured content, this IBM RedbooksÂŽ publication provides in-depth information about Content Analytics. This book examines the power and capabilities of Content Analytics, explores how it works, and explains how to design, prepare, install, configure, and use it to discover actionable business insights.

This book explains how to use the automatic text classification capability, from the IBM Classification Module, with Content Analytics. It explains how to use the LanguageWareÂŽ Resource Workbench to create custom annotators. It also explains how to work with the IBM Content Assessment offering to timely decommission obsolete and unnecessary content while preserving and using content that has business value.

The target audience of this book is decision makers, business users, and IT architects and specialists who want to understand and use their enterprise content to improve and enhance their business operations. It is also intended as a technical guide for use with the online information center to configure and perform content analysis with Content Analytics.

The cover article points out the Redbooks have an IBM slant, which isn’t surprising. When you need big iron for an enterprise project, that IBM is one of a handful of possible players isn’t surprising either.

Strong v Weak AI – The Chinese Room in 60 seconds

Filed under: Artificial Intelligence — Patrick Durusau @ 6:31 am

Strong v Weak AI – The Chinese Room in 60 seconds by Mike James.

Whichever side you are on, I think you will agree this is a very amusing and telling presentation. Certainly there is more that can be said for either side but this presentation captures its essence in 60 seconds.

What I keep searching for is a way to capture topic maps and their potential this succinctly.

Broad Institute

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 6:30 am

Broad Institute

In their own words:

The Eli and Edythe L. Broad Institute of Harvard and MIT is founded on two core beliefs:

  1. This generation has a historic opportunity and responsibility to transform medicine by using systematic approaches in the biological sciences to dramatically accelerate the understanding and treatment of disease.
  2. To fulfill this mission, we need new kinds of research institutions, with a deeply collaborative spirit across disciplines and organizations, and having the capacity to tackle ambitious challenges.

The Broad Institute is essentially an “experiment” in a new way of doing science, empowering this generation of researchers to:

  • Act nimbly. Encouraging creativity often means moving quickly, and taking risks on new approaches and structures that often defy conventional wisdom.
  • Work boldly. Meeting the biomedical challenges of this generation requires the capacity to mount projects at any scale — from a single individual to teams of hundreds of scientists.
  • Share openly. Seizing scientific opportunities requires creating methods, tools and massive data sets — and making them available to the entire scientific community to rapidly accelerate biomedical advancement.
  • Reach globally. Biomedicine should address the medical challenges of the entire world, not just advanced economies, and include scientists in developing countries as equal partners whose knowledge and experience are critical to driving progress.

The Detecting Novel Associations in Large Data Sets software and data is from the Broad Institute.

Sounds like the sort of place that would be interested in enhancing research and sharing of information with topic maps.

December 16, 2011

Detecting Novel Associations in Large Data Sets

Filed under: Bioinformatics,Data Mining,Statistics — Patrick Durusau @ 8:23 am

Detecting Novel Associations in Large Data Sets by David N. Reshef, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter J. Turnbaugh, Eric S. Lander, Michael Mitzenmacher, Pardis C. Sabeti.

Abstract:

Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.

Lay version: Tool detects patterns hidden in vast data sets by Haley Bridger.

Data and software: http://exploredata.net/.

From the article:

Imagine a data set with hundreds of variables, which may contain important, undiscovered relationships. There are tens of thousands of variable pairs—far too many to examine manually. If you do not already know what kinds of relationships to search for, how do you efficiently identify the important ones? Data sets of this size are increasingly common in fields as varied as genomics, physics, political science, and economics, making this question an important and growing challenge (1, 2).

One way to begin exploring a large data set is to search for pairs of variables that are closely associated. To do this, we could calculate some measure of dependence for each pair, rank the pairs by their scores, and examine the top-scoring pairs. For this strategy to work, the statistic we use to measure dependence should have two heuristic properties: generality and equitability.

By generality, we mean that with sufficient sample size the statistic should capture a wide range of interesting associations, not limited to specific function types (such as linear, exponential, or periodic), or even to all functional relationships (3). The latter condition is desirable because not only do relationships take many functional forms, but many important relationships—for example, a superposition of functions—are not well modeled by a function (4–7).

By equitability, we mean that the statistic should give similar scores to equally noisy relationships of different types. For example, we do not want noisy linear relationships to drive strong sinusoidal relationships from the top of the list. Equitability is difficult to formalize for associations in general but has a clear interpretation in the basic case of functional relationships: An equitable statistic should give similar scores to functional relationships with similar R2 values (given sufficient sample size).

Here, we describe an exploratory data analysis tool, the maximal information coefficient (MIC), that satisfies these two heuristic properties. We establish MIC’s generality through proofs, show its equitability on functional relationships through simulations, and observe that this translates into intuitively equitable behavior on more general associations. Furthermore, we illustrate that MIC gives rise to a larger family of statistics, which we refer to as MINE, or maximal information-based nonparametric exploration. MINE statistics can be used not only to identify interesting associations, but also to characterize them according to properties such as nonlinearity and monotonicity. We demonstrate the application of MIC and MINE to data sets in health, baseball, genomics, and the human microbiota. (footnotes omitted)

As you can imagine the line:

MINE statistics can be used not only to identify interesting associations, but also to characterize them according to properties such as nonlinearity and monotonicity.

caught my eye.

I usually don’t post until the evening but this looks very important. I wanted everyone to have a chance to grab the data and software before the weekend.

New acronyms:

MIC – maximal information coefficient

MINE – maximal information-based nonparametric exploration

Good thing they chose acronyms we would not be likely to confuse with other usages. 😉

Full citation:

Science 16 December 2011:
Vol. 334 no. 6062 pp. 1518-1524
DOI: 10.1126/science.1205438

December 15, 2011

EMC Greenplum puts a social spin on big data

Filed under: BigData,Chorus,Facebook,Hadoop — Patrick Durusau @ 7:53 pm

EMC Greenplum puts a social spin on big data

From the post:

Greenplum, the analytics division of EMC, has announced new software that lets data analysts explore all their organization’s data and share interesting findings and data sets Facebook-style among their colleagues. The product is called Chorus, and it wraps around EMC’s Greenplum Database and Hadoop distribution, making all that data available for the data team work with.

The pitch here is about unifying the analytic database and Hadoop environments and making it as easy and collaborative as possible to work with data, since EMC thinks a larger percentage of employees will have to figure out how to analyze business data. Plus, because EMC doesn’t have any legacy database or business intelligence products to protect, the entire focus of the Greenplum division is on providing the best big-data experience possible.

From the Chorus product page:

Greenplum Chorus enables Big Data agility for your data science team. The first solution of its kind, Greenplum Chorus provides an analytic productivity platform that enables the team to search, explore, visualize, and import data from anywhere in the organization. It provides rich social network features that revolve around datasets, insights, methods, and workflows, allowing data analysts, data scientists, IT staff, DBAs, executives, and other stakeholders to participate and collaborate on Big Data. Customers deploy Chorus to create a self-service agile analytic infrastructure; teams can create workspaces on the fly with self-service provisioning, and then instantly start creating and sharing insights.

Chorus breaks down the walls between all of the individuals involved in the data science team and empowers everyone who works with your data to more easily collaborate and derive insight from that data.

Note to EMC Greenplum: If you want people to at least consider products, don’t hide them so that searching is necessary to find them. Just an FYI.

Resources is pretty thin but better than the blah-blah “more information page.” Could have more details, perhaps a demo version?

A button that says “Contact Sales” makes me loose interest real quick. I don’t need some software sales person pinging me during an editing cycle to know if I have installed the “free” software yet and am I ready to order? Buying software really should be on my schedule, not his/hers. Yes?

Neo4jPHP

Filed under: Neo4j,PHP — Patrick Durusau @ 7:52 pm

Neo4jPHP by Josh Adell.

From the webpage:

PHP Wrapper for the Neo4j graph database REST interface

In-depth documentation and examples: http://github.com/jadell/neo4jphp/wiki

API documentation: http://jadell.github.com/neo4jphp

And from the introduction:

The goal of Neo4jPHP is to provide you with access to all the functionality of the Neo4j REST API via PHP. It does not provide a one-to-one correspondence with the REST API calls; instead, the REST interface is abstracted away so that you can concentrate on modelling your application’s domain in nodes and relationships. Neo4jPHP provides an API that is both intuitive and flexible, and it takes advantage of “under-the-hood” performance enhancements, such as caching and lazy-loading.

I just scanned through the documentation but it looks fairly clean and well-written. I pushed this out to the XML4Lib list but if you know of other library lists, please share it.

Neo4j JDBC driver

Filed under: JDBC,Neo4j — Patrick Durusau @ 7:51 pm

Neo4j JDBC driver

From the webpage:

This is a first attempt at creating a JDBC driver for the graph database Neo4j. While Neo4j is a graph database, and JDBC is based on the relational paradigm, this driver provides a way to bridge this gap.

This is done by introducing type nodes in the graph, which are directly related to the root node by the relationship TYPE. Each type node has a property “type” with its name (i.e. “tablename), and HAS_PROPERTY relationships to nodes that represent the properties that the node can have (i.e. “columns”). For each instance of this type (i.e. “row”) there is a relationship from the instance to the type node via the IS_A relationship. By using this structure the JDBC driver can mimic a relational database, and provide a means to execute queries against the Neo4j server.

Now that isn’t something you see everyday! 😉

What if there were a GrJDBC driver? A Graph JDBC driver? Such that it views tables, rows, columns, column headers, cells, values, as graph nodes with defined properties? Read from a configuration file that identifies some database:table.

Extending the recovery of investment in large relational clusters by endowing them with graph-like capabilities (dare I say topic map like capabilities?) would be a real plus in favor of adoption. Not to mention that in read-only mode, you could demonstrate it with the client’s data.

Contrast that with all the stammering from your competition about the need to convert, etc.

I will poke around because it seems like something like that has been done but it was a long time ago. I seem to remember it wasn’t a driver but a relational database built as a graph. The same principles should apply. If I find it I will post a link (if online) or a citation to the hard copy.

Google Map Maker Opens Its Editing Tools To Everyone

Filed under: Mapping,Maps — Patrick Durusau @ 7:50 pm

Google Map Maker Opens Its Editing Tools To Everyone By Jon Mitchell.

From the post:

Google announced a major redesign of Google Map Maker today. This is the tool that allows anyone to propose edits to the live Google map, so that locals can offer more detail than Google’s own teams can provide. The new tools offer simple ways to add and edit places, roads and paths, as well as reviewing the edits of others.

That peer review element is key to Google Maps’ new direction. In September, Google rearranged the Map Maker review process, deputizing regional expert reviewers to expand its capacity to handle crowd-sourced edits. Today’s new tools take that a step further, allowing anyone to review proposed edits before they’re incorporated into the live map.

Is there a lesson for crowd-topic map here?

Or do we have to go through the painful cycles of peer review + editors, only to eventually find that the impact on quality is nearly nil? At least for public maps. Speciality maps, where you have to at least know the domain, may, emphasis on may, be a different issue.

If you are a professional in a field, consider how many “peer-reviewed” articles from twenty (20) years ago are still cited today? They were supposed to be the best papers to be read at a conference or published in your flagship journal. Yes?

Some still are cited. Now that’s peer review. But it took twenty years to kick in.

I suspect the real issue for most topic maps is going to be too few contributors and not too many of the unwashed.

Mapping, like vocabularies, is a question of who gets to decide.

Topincs 5.7.0

Filed under: Topic Map Software,Topic Maps,Topincs — Patrick Durusau @ 7:49 pm

Topincs 5.7.0

From the webpage:

Description

This version offers a bundle of new features to make it easy for the developer to create tailored views for users with minimal coding effort:

  • Up to this Topincs version all statements made in a form were validated independent from each other. With compound constraints this has an end. By using tiny JavaScript snippets arbitrary validation rules can be formulated.
  • Customizable context menus on topic pages offer tailored actions that mean more to the user than the generic edit button. The context menu is by default to the left of the page on the opposite side of all generic functions. Forms can be entered with bound values inferred from the context (time, subject, …). This new feature bridges the gap from the generic web database to web application.
  • It is now possible to freeze topics in the user interface and the API.

Apart from these core features a number of smaller improvements and changes were made, most notably the support for SSL was verified.

Download here.

You would think software authors would not depend upon ragged bloggers to supply the download links for their software. 😉

That it would be the first thing out of their mouths: Download Topincs HERE! or something like that.

Maybe it is just me. With every release I have to think about how to get back to the downloads page.

Do take a look!

AlgLab: An Open Laboratory for Experiments On Algorithms

Filed under: Algorithms — Patrick Durusau @ 7:48 pm

AlgLab: An Open Laboratory for Experiments On Algorithms

From the webpage:

Welcome to the open laboratory for experiments on algorithms. Computer scientists experiment on algorithms for many reasons, including:

  • To understand their fundamental properties.
  • To compare several algorithms to choose the best one for application.
  • To tune an algorithm to perform well in a given context.

This web site contains laboratories for conducting experiments on algorithms. It is a companion site for A Guide to Experimental Algorithmics, by Catherine McGeoch (Cambridge University Press).

If you didn’t get the brochure, go to: www.cambridge.org/us/compsci11. Publication date is 2012.

Given the large number of similarity measures, I suspect that experiments with similarity algorithms is going to be a “hot” area of research as well. No one measure is going to be the “best” one for all uses.

What is unknown (at this point) is what factors may be indicators or break-points for choosing one similarity algorithm or another?

Raven DB – Stable Build – 573

Filed under: NoSQL,RavenDB — Patrick Durusau @ 7:45 pm

Raven DB – Stable Build – 573

From the email:

We now have a new stable build (finally).

It got delayed because of the new UI, and we *still have *new UI features that you’ll probably like that are going to show up on the unstable build, because I decided that enough is enough. We had almost two months without a real stable build, and we have had major work to improve things.

All our production stuff is now running 573. Here are all the new stuff:

Major:

  • The new UI is in the stable build
  • Optimized indexing – will not index documents that can be pre-filtered
  • Optimizing deletes
  • Reduced memory usage

New features:

  • Logs are available over the http API and using the new UI
  • Optimized handling of the server for big documents by streaming documents, rather than copying them
  • Updated to json.net 4.0.5
  • adding a way to control the capitalization of document keys
  • Added “More Like This” bundle
  • Licensing status is now reported in the UI.
  • Provide an event to notify about changes in failover status
  • Adding support for incremental backups
  • Allow nested queries to be optimized by the query optmizier
  • Use less memory on 32 bits systems
  • Raven.Backup executable
  • Much better interactive mode
  • Supporting projecting of complex paths
  • Support Count, Length on json paths
  • Allow to configure multi tenant idle times
  • Adding command line option to get all configuration documentation
  • Properly handle the scenario where we are unloading the domain / exiting without shutting down the database.
  • Will now push unstable versions to nuget as well

Something nice for the Windows side of the house.

SQL Database-as-a-Service

Filed under: Marketing,PostgreSQL,Topic Maps — Patrick Durusau @ 7:44 pm

SQL Database-as-a-Service

Documentation

Just starting the documentation but two quick thoughts:

First, most conventionally, this could be the back-end to a topic map server. Despite having started off many years ago in server administration (or perhaps because of it), server configuration/management isn’t an “additional” duty for any mission critical development/support staff. Too easy to hire server management services, who are capable of providing maintenance/support that no small firm could afford locally.

Second, a bit more unconventionally, this could be an illustration for a Topic-Map-As-Service. Think about it. If instead of the mish-mash that is Wikipedia, you had a topic maps of facts that were supplemented (read mapped) to records from various public reporting agencies, that could be interesting to “press” a local topic map against to acquire more recent data.

True, there are the public record services but they only give you person by person records and not relationships between them. Not to mention that if you are inventive, you could create some very interesting topic maps (intersections of records).

Imagine the stir that a topic map of license plates of cars with local plates at motels would cause. Rather than offering free access, since most people would only be interested in one license plate in particular, suggest that you show one plate at some random time each hour and where it was seen. (not the date) Sell advertising for the page where you offer the “free” sneak peak. Suspect you better have some load expansion capacity.

Ambiguity in the Cloud

Filed under: Cloud Computing,Marketing,Topic Maps,Uncategorized — Patrick Durusau @ 7:43 pm

If you are interested at all in cloud computing and its adoption, you need to read US Government Cloud Computing Technology Roadmap Volume I Release 1.0 (Draft). I know, a title like that is hardly inviting. But read it anyway. Part of a three volume set, for the other volumes see: NIST Cloud Computing Program.

Would you care to wager on out of ten (10) requirements, how many cited a need for interoperability that is presently lacking due to different understandings, terminology, in other words, ambiguity?

Good decision.

The answer? 8 out of 10 requirements cited by NIST have interoperability as a component.

The plan from NIST is to develop a common model, which will be a useful exercise, but how do we discuss differing terminologies until we can arrive at a common one?

Or allow for discussion of previous SLAs, for example, after we have all moved onto a new terminology?

If you are looking for a “hot” topic that could benefit from the application of topic maps (as opposed to choir programs at your local church during the Great Depression) this could be the one. One of those is a demonstration of a commercial grade technology, the other is at best a local access channel offering. You pick which is which.

December 14, 2011

IBM and Drug Companies Donate Data

Filed under: Cheminformatics,Dataset — Patrick Durusau @ 7:46 pm

IBM Contributes Data to the National Institutes of Health to Speed Drug Discovery and Cancer Research Innovation

From the post:

In collaboration with AstraZeneca, Bristol-Myers Squibb, DuPont and Pfizer, IBM is providing a database of more than 2.4 million chemical compounds extracted from about 4.7 million patents and 11 million biomedical journal abstracts from 1976 to 2000. The announcement was made at an IBM forum on U.S. economic competitiveness in the 21st century, exploring how private sector innovations and investment can be more easily shared in the public domain.

Excellent news and kudos to IBM and its partners for making the information available!

Now it is up to you to find creative ways to explore, connect up, analyze the data across other information sets.

My first question would be what was mentioned besides chemicals in the biomedical journal abstracts? Care to make an association to represent that relationship?

Why? Well, for example, if you are exposed to raw benzene, a by product of oil refining, it can produce symptoms that are nearly identical to leukemia. Where would you encounter such a substance? Well, try living in Nicaragua for more than a decade and every day the floors are cleaned with raw benzene. Of course, in the States, doctors don’t check for exposure to banned substances. Cases like that.

BTW, the data is already up, see: PubChem. Follow the links to the interface and click on “structures.” Not my area but the chemical structures are interesting enough that I may have to get a chemistry book for Christmas so I can have some understanding of what I am seeing.

That is probably the best part of being interested in semantic integration is that it cuts across all fields and new discoveries await with every turn of the page.

Cloudera Manager 3.7 released

Filed under: Hadoop,MapReduce — Patrick Durusau @ 7:46 pm

Cloudera Manager 3.7 released

From the post:

Cloudera Manager 3.7, a major new version of Cloudera’s Management applications for Apache Hadoop, is now available. Cloudera Manager Free Edition is a free download, and the Enterprise edition of Cloudera Manager is available as part of the Cloudera Enterprise subscription.

Cloudera Manager 3.7 includes several new features and enhancements:

  • Automated Hadoop Deployment – Cloudera Manager 3.7 allows you to install the complete Hadoop stack in minutes. We’ve now upgraded Cloudera Manager with the easy installation we first introduced in version 3.6 of SCM Express. (SCM Express is now replaced by Cloudera Manager Free Edition.).
  • Centralized Management UI – Version 3.5 of the Cloudera Management Suite included distinct modules for Resource Management, Activity Monitoring and Service and Configuration Management. In Cloudera Manager 3.7, all of these feature sets are now integrated into one centralized browser-based administration console.
  • Service & Configuration Management – We added several new configuration wizards to guide you in properly configuring HDFS and HBase host deployments, adding new hosts on demand, and adding/restarting services as needed. Cloudera Manager 3.7 now also manages Oozie and Hue.
  • Service Monitoring – Cloudera Manager monitors the health of your key Hadoop services—HDFS, HBase, MapReduce—and displays alerts on suspicious or bad health. For example, to determine the health of HDFS, Cloudera Manager measures the percentage of corrupt, missing, or under-replicated blocks. Cloudera Manager also checks if the NameNode is swapping memory or spending too much time in Garbage Collection, and whether HDFS has enough free space. Trends in relevant metrics can be visualized through time-series charts.
  • Log Search – You can search through all logs for Hadoop services across the whole cluster. You can also view results filtered by service, role, host, search phrase and log event severity.
  • Events and Alerts – Cloudera Manager proactively reports on important events such as the change in a service’s health, detection of a log message of appropriate severity, or the slowness (or failure) of a job. Cloudera Manager aggregates the events for easy filtering and viewing, and you can configure Cloudera Manager to send email alerts.
  • Global Time Control – You can view the state of your system for any time period in the past. Combined with health state, events and log information, this feature serves as a powerful diagnostic tool.
  • Role-based Administration – Cloudera Manager 3.7 supports two types of users: admin users, who can change configs and execute commands and workflows; and read-only users, who can only monitor the system.
  • Configuration versioning and Audit trails – You can view a complete history of configuration changes with user annotations. You can roll-back to previous configuration states.
  • Activity Monitoring – The Activity Monitoring feature includes several performance and scale improvements.
  • Operational Reports – The ‘Resource Manager’ feature in the Cloudera Management Suite 3.5 is now in Cloudera Manager’s ‘Reports’ feature. You can visualize disk usage by user, group, and directory; you can track MapReduce activity on the cluster by job, or by user.
  • Support Integration – We’ve improved the Cloudera support experience by adding a feature that lets you send a snapshot of your cluster state to our support team for expedited resolution.
  • Cloudera Manager Free Edition and 1-click Upgrade – The Free Edition of Cloudera Manager includes a subset of the features described above. After you install Cloudera Manager Free Edition, you can easily upgrade to the Enterprise edition by entering a license key. Your data will be preserved as the Cloudera Manager wizard guides you through the upgrade.

You can download the new Cloudera Manager 3.7 at: https://ccp.cloudera.com/display/SUPPORT/Downloads . Check it out. We look forward to your feedback.

P.S. : We’re hiring! Visit: http://www.cloudera.com/company/careers

Makes you wonder where we will be in a year from now? Not just with Hadoop but algorithms for graphs, functional data structures, etc.

Care to make any forecasts?

A Task-based Model of Search

Filed under: Modeling,Search Behavior,Search Interface,Searching — Patrick Durusau @ 7:46 pm

A Task-based Model of Search by Tony Russell-Rose.

From the post:

A little while ago I posted an article called Findability is just So Last Year, in which I argued that the current focus (dare I say fixation) of the search community on findability was somewhat limiting, and that in my experience (of enterprise search, at least), there are a great many other types of information-seeking behaviour that aren’t adequately accommodated by the ‘search as findability’ model. I’m talking here about things like analysis, sensemaking, and other problem-solving oriented behaviours.

Now, I’m not the first person to have made this observation (and I doubt I’ll be the last), but it occurs to me that one of the reasons the debate exists in the first place is that the community lacks a shared vocabulary for defining these concepts, and when we each talk about “search tasks” we may actually be referring to quite different things. So to clarify how I see the landscape, I’ve put together the short piece below. More importantly, I’ve tried to connect the conceptual (aka academic) material to current design practice, so that we can see what difference it might make if we had a shared perspective on these things. As always, comments & feedback welcome.

High marks for a start on what complex and intertwined issues.

Not so much that we will reach a common vocabulary but so we can be clearer about where we get confused when moving from one paradigm to another.

Stop Mining Data!

Filed under: Data Mining — Patrick Durusau @ 7:46 pm

Stop Mining Data! by Matthew Hurst.

The title caught my attention, particularly given that Matthew Hurst was saying it!

From the post:

In some recent planning and architectural discussions I’ve become aware of the significant difference between reasoning about data and reasoning about the world that the data represents.

Before reading his post, care to guess what entity is “…reasoning about the world that the data represents.”?

Go on! Take a chance! 😉

I don’t know if his use of “record linkage” was in the technical sense or not. Will have to ask.

Prosper Loan Data Part II of II – Social Network Analysis: What is the Value of a Friend?

Filed under: Data Mining,Social Networks — Patrick Durusau @ 7:45 pm

Prosper Loan Data Part II of II – Social Network Analysis: What is the Value of a Friend?

From the post:

Since Prosper provides data on members and their friends who are also members, we can conduct a simple “social network” analysis. What is the value of a friend when getting approved for a loan through Prosper? I first determined how many borrowers were approved and how many borrowers were declined for a loan. Next, I determined how many approved friends each borrower had.

Moral of this story: Pick better friends. 😉

Question: Has anyone done the same sort of analysis on arrest/conviction records? Include known children in the social network as well.

What other information would you want to bind into the social network?

A TinkerPop Story

Filed under: Blueprints,Frames,Furnace,Gremlin,Pipes,Rexster,TinkerPop — Patrick Durusau @ 7:45 pm

A TinkerPop Story

From the post:

In a time long, long right now and a place far, far within, there exists a little green gremlin named…well, Gremlin. Gremlin lives in a place known as TinkerPop. For those who think of a “place” as some terrestrial surface coating a sphere that is circling one of the many massive fiery nuclear reactors in the known universe, TinkerPop is that, yet at the same time, a wholly different type of place indeed.

In a day of obscure (are there any other kind?) errors and annoyances, this is an absolute delight!

Highly recommended!

Nutch Tutorial: Supplemental III

Filed under: Nutch,Search Engines,Searching — Patrick Durusau @ 7:45 pm

Apologies for the diversion in Nutch Tutorial: Supplemental II.

We left off last time with a safe way to extract the URLs from the RDF text without having to parse the XML and without having to expand the file onto the file system. And we produced a unique set of URLs.

We still need a random set of URLs, 1,000 was the amount mentioned in the Nutch Tutorial at Option 1.

Since we did not parse the RDF, we can’t use the subset option for org.apache.nutch.tools.DmozParser.

So, back to the Unix command line and our file with 3838759 lines in it, each with a unique URL.

Let’s do this a step at a time and we can pipe it all together below.

First, our file is: dmoz.urls.gz, so we expand it with gunzip:

gunzip dmoz.urls.gz

Results in dmoz.urls

The we run the shuf command, which randomly shuffles the lines in the file:

shuf dmoz.urls > dmoz.shuf.urls

Remember the < command pipes the results to another file.

Now the lines are in random order. But it is still the full set of URLs.

So we run the head command to take the first 1,000 lines off of our now randomly sorted file:

head -1000 dmoz.shuf.urls > dmoz.shuf.1000.urls

So now we have a file with 1,000 randomly chosen URLs from our DMOZ source file.

Here is how to do all that in one line:

gunzip -c dmoz.urls.gz | shuf | head -1000 > dmoz.shuf.1000.urls

BTW, in case you are worried about the randomness of your set, so many of us are not hitting the same servers with our test installations, don’t be.

I ran shuf twice in a row on my set of URL and then ran diff, which reported the first 100 lines were in a completely different order.

BTW, to check yourself on the extracted set of 1,000 URLs, run the following:

wc -l dmoz.shuf.1000.urls

Result should be 1000.

The wc command prints newline, word and byte counts. With the -l option, it prints new line counts.

In case you don’t have the shuf command on your system, I would try:

sort -R dmoz.urls > dmoz.sort.urls

as a substitute for shuf dmoz.urls > dmoz.shuf.urls

Hillary Mason (source of the sort suggestion, has collected more extract one line (not exactly our requirement but you can be creative) at: How to get a random line from a file in bash.

I am still having difficulties with one way to use Nutch/Solr so we will cover the “other” path, the working one, tomorrow. It looks like a bug between versions and I haven’t found the correct java class to copy over at this point. Not like a tutorial to mention that sort of thing. 😉

Network Graph Visualizer

Filed under: Collaboration,Networks,Software,Visualization — Patrick Durusau @ 7:44 pm

Network Graph Visualizer

I ran across this at Github while tracking the progress of a project.

Although old hat (2008), I thought it worth pointing out as a graph that has one purpose, to keep developers informed of each others’ activities in a collaborative environment, and it does that very well.

I suspect there is a lesson there for topic map software (or even software in general).

agamemnon – update

Filed under: agamemnon,Cassandra — Patrick Durusau @ 7:44 pm

agamemnon – update

When I last looked at agamemnon, it was not part of the globusonline.org project.

Nor do I recall support for RDF or the extended example of how to use it with your code.

Sorry, agamemnon, for those of you who don”t recall, is a library that enables you to use Cassandra as graph database.

Think about the speed and scalability of Cassandra for a moment and you will see the potential for such a matchup.

The Shape of Things – SHAPES 1.0

Filed under: Conferences,Dimensions,Semantics,Shape — Patrick Durusau @ 7:44 pm

The Shape of Things – SHAPES 1.0

Proceedings of the First Interdisciplinary Workshop on SHAPES, Karlsruhe, Germany, September 27, 2011. Edited by: Janna Hastings, Oliver Kutz, Mehul Bhatt, Stefano Borgo

If you have ever thought of “shape” as being a simple issue, consider the abstract from “Shape is a Non-Quantifiable Physical Dimension” by Ingvar Johansson:

In the natural-scientific community it is often taken for granted that, sooner or later, all basic physical property dimensions can be quantified and turned into a kind-of-quantity; meaning that all their possible determinate properties can be put in a one-to-one correspondence with the real numbers. By using some transfinite mathematics, the paper shows this tacit assumption to be wrong. Shape is a very basic property dimension; but, since it can be proved that there are more possible kinds of determinate shapes than real numbers, shape cannot be quantified. There will never be a shape scale the way we have length and temperature scales. This is the most important conclusion, but more is implied by the proof. Since every n-dimensional manifold has the same cardinality as the real number line, all shapes cannot even be represented in a three-dimensional manifold the way perceivable colors are represented in so-called color solids.

If shape, which exists in metric space has these issues, that casts a great deal of doubt on mapping semantics, which exists in non-metric space, in a “…one-to-one correspondence with real numbers.”

Don’t you think?

We can make simplifying assumptions about semantics and make such mappings, but we need to be aware that is what is happening.

Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus

Filed under: Corpus Linguistics,Dataset — Patrick Durusau @ 11:00 am

Creating a Live, Public Short Message Service Corpus: The NUS SMS Corpus by Tao Chen, Min-Yen Kan.

Abstract:

Short Message Service (SMS) messages are largely sent directly from one person to another from their mobile phones. They represent a means of personal communication that is an important communicative artifact in our current digital era. As most existing studies have used private access to SMS corpora, comparative studies using the same raw SMS data has not been possible up to now. We describe our efforts to collect a public SMS corpus to address this problem. We use a battery of methodologies to collect the corpus, paying particular attention to privacy issues to address contributors’ concerns. Our live project collects new SMS message submissions, checks their quality and adds the valid messages, releasing the resultant corpus as XML and as SQL dumps, along with corpus statistics, every month. We opportunistically collect as much metadata about the messages and their sender as possible, so as to enable different types of analyses. To date, we have collected about 60,000 messages, focusing on English and Mandarin Chinese.

A unique and publicly available corpus of material.

Your average marketing company might not have an SMS corpus for you to work with but I can think of some other organizations that do. 😉 Train on this one to win your spurs.

December 13, 2011

ACM RecSys 2011 Workshop on Novelty and Diversity in Recommender Systems

Filed under: Diversity,Novelty,Recommendation — Patrick Durusau @ 9:55 pm

DiveRS 2011 – ACM RecSys 2011 Workshop on Novelty and Diversity in Recommender Systems

From the conference page:

Most research and development efforts in the Recommender Systems field have been focused on accuracy in predicting and matching user interests. However there is a growing realization that there is more than accuracy to the practical effectiveness and added-value of recommendation. In particular, novelty and diversity have been identified as key dimensions of recommendation utility in real scenarios, and a fundamental research direction to keep making progress in the field.

Novelty is indeed essential to recommendation: in many, if not most scenarios, the whole point of recommendation is inherently linked to a notion of discovery, as recommendation makes most sense when it exposes the user to a relevant experience that she would not have found, or thought of by herself –obvious, however accurate recommendations are generally of little use.

Not only does a varied recommendation provide in itself for a richer user experience. Given the inherent uncertainty in user interest prediction –since it is based on implicit, incomplete evidence of interests, where the latter are moreover subject to change–, avoiding a too narrow array of choice is generally a good approach to enhance the chances that the user is pleased by at least some recommended item. Sales diversity may enhance businesses as well, leveraging revenues from market niches.

It is easy to increase novelty and diversity by giving up on accuracy; the challenge is to enhance these aspects while still achieving a fair match of the user’s interests. The goal is thus generally to enhance the balance in this trade-off, rather than just a diversity or novelty increase.

DiveRS 2011 aims to gather researchers and practitioners interested in the role of novelty and diversity in recommender systems. The workshop seeks to advance towards a better understanding of what novelty and diversity are, how they can improve the effectiveness of recommendation methods and the utility of their outputs. We aim to identify open problems, relevant research directions, and opportunities for innovation in the recommendation business. The workshop seeks to stir further interest for these topics in the community, and stimulate the research and progress in this area.

The abstract from “Fusion-based Recommender System for Improving Serendipity” by Kenta Oku, Fumio Hattori reads:

Recent work has focused on new measures that are beyond the accuracy of recommender systems. Serendipity, which is one of these measures, is defined as a measure that indicates how the recommender system can find unexpected and useful items for users. In this paper, we propose a Fusion-based Recommender System that aims to improve the serendipity of recommender systems. The system is based on the novel notion that the system finds new items, which have the mixed features of two user-input items, produced by mixing the two items together. The system consists of item-fusion methods and scoring methods. The item-fusion methods generate a recommendation list based on mixed features of two user-input items. Scoring methods are used to rank the recommendation list. This paper describes these methods and gives experimental results.

Interested yet? 😉

« Newer PostsOlder Posts »

Powered by WordPress