May « 2012 « Another Word For It

May 24, 2012

Lima on Networks

Filed under: Complex Networks,Complexity,Graphs,Networks — Patrick Durusau @ 2:10 pm

I saw a mention of RSA Animate – The Power of Networks by Manuel Lima over at Flowing Data.

A high speed chase through ideas but the artistry of the presentation and presenter make it hold together quite nicely.

Manuel makes the case that organization of information is more complex than trees. In fact, makes a good case for networks being a better model.

If that bothers you, you might want to cut Manuel some slack or perhaps even support the “network” (singular) model.

There are those of us who don’t think a single network is sufficient.

😉

Resources to review before viewing the video:

Science and Complexity – Warren Weaver (1948 – reprint): The paper that Manuel cites in his presentation.

Wikipedia – Complexity Not bad as Wikipedia entries go. At least a starting point.

Comments Off

Web sequence diagrams

Filed under: Authoring Topic Maps,Collaboration,Graphics,Visualization — Patrick Durusau @ 9:45 am

Web sequence diagrams

I ran across this while looking for information on Lucene indexing.

It may be that I am confusing the skill of the author with the utility of the interface (which may be commonly available via other sources) but I was impressed enough that I wanted to point it out.

It does seem a bit pricey ($99 for two users) but on the other hand, developing good documentation is (should be) a team based task. This would be a good way to insure a common understanding of sequences of operations.

Are there similar tools you would recommend for team based activities?

Thinking that authoring a topic map is very much a team activity. From domain experts who vet content to UI experts who create and test interfaces to experts who load and maintain content servers and others.

Keeping a common sense of purpose and interdependence (team effort) goes a long way to a successful project conclusion.

Comments Off

May 23, 2012

Merging Market News – 23 May 2012

Filed under: Marketing,Topic Maps — Patrick Durusau @ 6:20 pm

On the merging market front, the need for merging between different IT systems, I read happy news at:

451 Research delivers market sizing estimates for NoSQL, NewSQL and MySQL ecosystem by Matthew Aslett.

From the post:

NoSQL and NewSQL database technologies pose a long-term competitive threat to MySQL’s position as the default database for Web applications, according to a new report published by 451 Research.

The report, MySQL vs. NoSQL and NewSQL: 2011-2015, examines the competitive dynamic between MySQL and the emerging NoSQL non-relational, and NewSQL relational database technologies.

It concludes that while the current impact of NoSQL and NewSQL database technologies on MySQL is minimal, they pose a long-term competitive threat due to their adoption for new development projects. The report includes market sizing and growth estimates, with the key findings as follows:

You can get a copy of the report if you like but the important theme is that different IT vocabularies and approaches are going to be in play.

Which means translation costs between systems are going to sky rocket and be repeated with every IT spasm or change.

Unless you are hired to address integration/migration problems with topic maps of course.

On the database front, I would say things look pretty bright for topic maps!

PS: Any thoughts on how the collapse of Greece or its becoming a failed state is going to impact the merging market?

Comments Off

1 Billion Pages Visited In 2012

Filed under: ClueWeb2012,Data Source,Lemur Project — Patrick Durusau @ 6:08 pm

The ClueWeb12 project reports:

The Lemur Project is creating a new web dataset, tentatively called ClueWeb12, that will be a companion or successor to the ClueWeb09 web dataset. This new dataset is expected to be ready for distribution in June 2012. Dataset construction consists of crawling the web for about 1 billion pages, web page filtering, and organization into a research-ready dataset.

…

The crawl was initially seeded with 2,820,500 uniq URLs. This list was generated by taking the 10 million ClueWeb09 urls that had the highest PageRank scores, and then removing any page that was not in the top 90% of pages as ranked by Waterloo spam scores (i.e., least likely to be spam). Two hundred sixty-two (262) seeds were added from the most popular sites in English-speaking countries, as reported by Alexa. The number of sites selected from each country depended on its relative population size, for example, United States (71.0%), United Kindom (14.0%), Canada (7.7%), Australia (5.2%), Ireland (3.8%), and New Zealand (3.7%). Finally, Charles Clark, University of Waterloo, provided 5,950 seeds specific to travel sites.

A blacklist was used to avoid sites that are reported to distribute pornography, malware, and other material that would not be useful in a dataset intended to support a broad range of research on information retrieval and natural language understanding. The blacklist was obtained from a commercial managed URL blacklist service, URLBlacklist.com, which was downloaded on 2012-02-03. The crawler blackliset consists of urls in the malware, phishing, spyware, virusinfected, filehosting and filesharing categories. Also included in the blacklist is a small number (currently less than a dozen) of sites that opted out of the crawl.

…

The crawled web pages will be filtered to remove certain types of pages, for example, pages that a text classifier identifies as non-English, pornography, or spam. The dataset will contain a file that identifies each url that was removed and why it was removed. The web graph will contain all pages visited by the crawler, and will include information about redirected links.

The crawler captures an average of 10-15 million pages (and associated images, etc) per day. Its progress is documented in a daily progess report.

Are there any search engine ads: X billion of pages crawled?

Comments Off

White House launches new digital government strategy

Filed under: Government Data — Patrick Durusau @ 2:41 pm

White House launches new digital government strategy by Alex Howard.

From the post:

There’s a long history of people who have tried to transform the United States federal government through better use of information technology and data. It extends back to the early days of Alexander Hamilton’s ledgers of financial transaction, continues through information transmitted through telegraph, radio, telephone, and comes up to the introduction of the Internet, which has been driving dreams of better e-government for decades.

Vivek Kundra, the first U.S. chief information officer, and Aneesh Chopra, the nation’s first chief technology officer, were chosen by President Barack Obama to try to bring the federal government’s IT infrastructure and process into the 21st century, closing the IT gap that had opened between the private sector and public sector.

Today, President Obama issued a presidential memorandum on building a 21st century digital government.

In this memorandum, the president directs each major federal agency in the United States to make two key services that American citizens depend upon available on mobile devices within the next 12 months and to make “applicable” government information open and machine-readable by default. President Obama directed federal agencies to do two specific things: comply with the elements of the strategy by May 23, 2013 and to create a “/developer” page on ever major federal agency’s website.

Thought you might find some good marketing quotes for your products or services in the article or the presidential memorandum.

I do have to wince when I read:

For far too long, the American people have been forced to navigate a labyrinth of information across different Government programs in order to find the services they need.

Obviously it has been a while since President Obama has called a tech support line. My experiences recently have been good but then also very few. There is probably a relationship there.

There is going to be a lot of IT churn if not actual change so dust off your various proposals and watch for agency calls for assistance.

Don’t forget to offer topic map based solutions for agencies that want to find data once and not time after time.

Comments Off

Forecasting: principles and practice

Filed under: Business Intelligence,Forecasting — Patrick Durusau @ 2:23 pm

Forecasting: principles and practice: An online textbook by Rob J Hyndman and George Athanasopoulos.

From the preface:

Welcome to our new online textbook on forecasting. This book is intended as a replacement for Makridakis, Wheelwright and Hyndman (Wiley 1998).

The entire book is available online and free-of-charge. Of course, we won’t make much money doing this, but textbooks never make much money anyway — the publishers make all the money. We’d rather create something that is widely used and useful, than have large publishers profit from our efforts.

Eventually a print version of the book will be available to purchase on Amazon, but not until a few more chapters are written.

This textbook is intended to provide a comprehensive introduction to forecasting methods and present enough information about each method for readers to use them sensibly. We don’t attempt to give a thorough discussion of the theoretical details behind each method, although the references at the end of each chapter will fill in many of those details.

The book is written for three audiences: (1) people finding themselves doing forecasting in business when they may not have had any formal training in the area; (2) undergraduate students studying business; (3) MBA students doing a forecasting elective. We use it ourselves for a second-year subject for students undertaking a Bachelor of Commerce degree at Monash University, Australia.

Should be a useful resource for learning the forecasting “lingo” in a business context. Or for learning forecasting for that matter.

The middle chapters on regression, as the authors point out, are unfinished by they hope to have the book complete by the end of 2012.

It could be a really nice gesture on our part if we all read a chapter or so and suggested corrections to improvements to the prose.

Comments Off

New UMBEL Release Gains schema.org, GeoNames Capabilities

Filed under: Geographic Data,GeoNames,Schema.org,UMBEL — Patrick Durusau @ 2:12 pm

New UMBEL Release Gains schema.org, GeoNames Capabilities by Mike Bergman.

From the post:

We are pleased to announce the release of version 1.05 of UMBEL, which now has linkages to schema.org [6] and GeoNames [1]. UMBEL has also been split into ‘core’ and ‘geo’ modules. The resulting smaller size of UMBEL ‘core’ — now some 26,000 reference concepts — has also enabled us to create a full visualization of UMBEL’s content graph.

Mapping to schema.org

The first notable change in UMBEL v. 1.05 is its mapping to schema.org. schema.org is a collection of schema (usable as HTML tags) that webmasters can use to markup their pages in ways recognized by major search providers. schema.org was first developed and organized by the major search engines of Bing, Google and Yahoo!; later Yandex joined as a sponsor. Now many groups are supporting schema.org and contributing vocabularies and schema.

You will appreciate the details of the writeup and like the visualization. Quite impressive!

PS: As if you didn’t know:

http://umbel.org/

This is the official Web site for the UMBEL Vocabulary and Reference Concept Ontology (namespace: umbel). UMBEL is the Upper Mapping and Binding Exchange Layer, designed to help content interoperate on the Web.

Comments Off

Clustering is difficult only when it does not matter

Filed under: Clustering — Patrick Durusau @ 9:23 am

Clustering is difficult only when it does not matter by Amit Daniely, Nati Linial, Michael Saks.

Abstract:

Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners’ perspective is much more optimistic. Our explanation for this disparity of opinions is that complexity theory concentrates on the worst case, whereas in reality we only care for data sets that can be clustered well.

We introduce a theoretical framework of clustering in metric spaces that revolves around a notion of “good clustering”. We show that if a good clustering exists, then in many cases it can be efficiently found. Our conclusion is that contrary to popular belief, clustering should not be considered a hard task.

Considering that clustering is a first step towards merging, you will find the following encouraging:

From the practitioner’s viewpoint, “clustering is either easy or pointless” &emdash; that is, whenever the input admits a good clustering, finding it is feasible. Our analysis provides some support to this view.

I would caution that the authors are working with metric spaces.

It isn’t clear to me that clustering based on values in non-metric spaces would share the same formal characteristics.

Comments or pointers to work on clustering in non-metric spaces?

Comments Off

May 22, 2012

Happy Go Lucky Identification/Merging?

Filed under: Identity,Merging — Patrick Durusau @ 3:32 pm

MIT News: New mathematical framework formalizes oddball programming techniques

From the post:

Two years ago, Martin Rinard’s group at MIT’s Computer Science and Artificial Intelligence Laboratory proposed a surprisingly simple way to make some computer procedures more efficient: Just skip a bunch of steps. Although the researchers demonstrated several practical applications of the technique, dubbed loop perforation, they realized it would be a hard sell. “The main impediment to adoption of this technique,” Imperial College London’s Cristian Cadar commented at the time, “is that developers are reluctant to adopt a technique where they don’t exactly understand what it does to the program.”

I like that for making topic maps scale, “…skip a bunch of steps….”

Topic maps, the semantic web and similar semantic ventures are erring on the side of accuracy.

We are often mistaken about facts, faces, identifications in semantic terminology.

Why think we can build programs or machines that can do better?

Let’s stop rolling the identification stone up the hill.

Ask “how accurate does the identification/merging need to be?”

The answer for aiming a missile is probably different than sorting emails in a discovery process.

If you believe in hyperlinks:

Proving Acceptability Properties of Relaxed Nondeterministic Approximate Programs Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin Rinard, Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2012) Beijing, China June 2012

From Martin Rinard’s publication page.

Has other interesting reading.

Comments Off

MongoSF Highlights

Filed under: Conferences,MongoDB — Patrick Durusau @ 2:57 pm

I am on the Mongo mailing list and so got the monthly news about MongoDB, which included a list of highlights from MongoSF:

MongoDB and Hadoop, Steve Francia, 10gen
MongoDB for Analytics, John Nunemaker, Github
High Availability with MongoDB, Greg Brockman, Stripe
MongoDB Schema Design: Insights and Tradeoffs, Montse Medina, Jetlore
Backup Strategies, Tony Tam, Wordnik
MongoDB at Craigslist: One Year Later, Jeremy Zawodny, Craigslist
Scaling the MapMyFitness Platform, Chris Mertz, MapMyFitness

Except in the email the links had all the tracking trash that marketing types seem to think is important.

I visited the 10gen site and harvested the direct links for your convenience. I didn’t insert tracking trash for my blog.

Enjoy!

PS: It would really be nice to get emails that have the tracking trash if you insist but also clean links that can be forwarded to others, used in blog posts, real information type activities. Not to single out 10gen, I see it every day. From people who should know better.

PPS: There are more presentations to view at: Featured Presentations.

Comments Off

Health Care Cost Institute

Filed under: Data Source,Health care — Patrick Durusau @ 2:37 pm

Health Care Cost Institute

I can’t give you a clean URL but on Monday (21 May 2012), the Washington Post ran a story on the Health Care Cost Institute, which had the following quotes:

This morning a new nonprofit called the Health Care Cost Institute will roll out a database of 5 billion health insurance claims (all stripped of the individual health plan’s identity, to address privacy concerns).

…

This is the first study to use the HCCI data, although more are in the works. Gaynor has been inundated with about 130 requests from health policy researchers to use the database. While his team sifts through those, three approved studies are already tackling big health policy questions.

…

“There is immense interest in gaining access,” says HCCI executive director David Newman. “We’re having trouble keeping up with that.” (emphasis added)

Sorry, that went by a little fast. The data has already been scrubbed so why the choke point of the Health Care Cost Insitute on the data?

Spin it up to one or more clouds that support free public storage for data sets of public interest.

Problem of sorting through access request is solved.

Just maybe researchers will want to address other questions, ones that aren’t necessarily about costs. And/or combine this data with other data. Like data on local pollution. (Although you would need historical data to make that work.)

Mapping this data set to other data sets could only magnify its importance.

Many thanks are owed to the Health Care Cost Institute for securing the data set.

But our thanks should not include electing the HCCI as censor of uses of this data set.

Comments Off

SQL Azure Labs Posts

Filed under: Azure Marketplace,Microsoft,SQL,Windows Azure,Windows Azure Marketplace — Patrick Durusau @ 10:36 am

Roger Jennings writes in Recent Articles about SQL Azure Labs and Other Value-Added Windows Azure SaaS Previews: A Bibliography:

I’ve been concentrating my original articles for the past six months or so on SQL Azure Labs, Apache Hadoop on Windows Azure and SQL Azure Federations previews, which I call value-added offerings. I use the term value-added because Microsoft doesn’t charge for their use, other than Windows Azure compute, storage and bandwidth costs or SQL Azure monthly charges and bandwidth costs for some of the applications, such as Codename “Cloud Numerics” and SQL Azure Federations.

As of 22 May 2012, there are forty-four (44) posts in the following categories:

Windows Azure Marketplace DataMarket plus Codenames “Data Hub” and “Data Transfer” from SQL Azure Labs
Apache Hadoop on Windows Azure from the SQL Server Team
Codename “Cloud Numerics” from SQL Azure Labs
Codename “Social Analytics from SQL Azure Labs
Codename “Data Explorer” from SQL Azure Labs
SQL Azure Federations from the SQL Azure Team

If you need quick guides and/or incentives to use Windows Azure, try these on for size.

Comments Off

Uncertainty Principle for Serendipity?

Filed under: Analytics,Data Integration — Patrick Durusau @ 10:22 am

Curt Monash writes in Cool analytic stories

There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …

… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.

Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.

How will we “know” when better data display/mining techniques enable more serendipity?

Anecdotal stories about serendipity abound.

Measuring serendipity requires knowing: (rate of serendipitous discoveries x importance of serendipitous discoveries)/ opportunity for serendipitous discoveries.

Need to add in a multiplier effect for the impact that one serendipitous discovery may have to create opportunities or other serendipitous discoveries (a serendipitous criticality point) and probably some other things I have overlooked.

What would you add to the equation?

Realizing that we may be staring at the “right” answer and never realize it.

How’s that for an uncertainty principle?

Comments Off

May 21, 2012

Call for Papers: PLoS Text Mining Collection

Filed under: Data Mining,Text Mining — Patrick Durusau @ 7:15 pm

Call for Papers: PLoS Text Mining Collection by Camron Assadi.

From the post:

The Public Library of Science (PLoS) seeks submissions in the broad field of text-mining research for a collection to be launched across all of its journals in 2013. All submissions submitted before October 30th, 2012 will be considered for the launch of the collection. Please read the following post for further information on how to submit your article.

The scientific literature is exponentially increasing in size, with thousands of new papers published every day. Few researchers are able to keep track of all new publications, even in their own field, reducing the quality of scholarship and leading to undesirable outcomes like redundant publication. While social media and expert recommendation systems provide partial solutions to the problem of keeping up with the literature, systematically identifying relevant articles and extracting key information from them can only come through automated text-mining technologies.

Research in text mining has made incredible advances over the last decade, driven through community challenges and increasingly sophisticated computational technologies. However, the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model. As Open Access publishing yields an ever-growing archive of unrestricted full-text articles, text mining will play an increasingly important role in drilling down to essential research and data in scientific literature in the 21st century scholarly landscape.

As part of its commitment to realizing the maximal utility of Open Access literature, PLoS is launching a collection of articles dedicated to highlighting the importance of research in the area of text mining. The launch of this Text Mining Collection complements related PLoS Collections on Open Access and Altmetrics (forthcoming), as well as the recent release of the PLoS Application Programming Interface, which provides an open API to PLoS journal content.

Highly recommend that you follow up on this publication opportunity.

I am less certain that: “…the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model.”

Don’t recall seeing any research on a connection between a lack of Open Access and failure of text mining to accelerate research.

CiteSeer and arXiv have long been freely available in full text. If research were going to leap forward from open access, the opportunity has been present.

Open access does advance research and discovery but it isn’t a magic bullet. Accelerating and enhancing research is going to require more than simply indexing literature. A lot more.

Comments (3)

A Look At Google BigQuery

Filed under: Business Intelligence,Google BigQuery — Patrick Durusau @ 6:55 pm

A Look At Google BigQuery

Chris Webb writes:

Over the years I’ve written quite a few posts about Google’s BI capabilities. Google never seems to get mentioned much as a BI tools vendor but to me it’s clear that it’s doing a lot in this area and is consciously building up its capabilities; you only need to look at things like Fusion Tables (check out these recently-added features), Google Refine and of course Google Docs to see that it’s pursuing a self-service, information-worker-led vision of BI that’s very similar to the one that Microsoft is pursuing with PowerPivot and Data Explorer.

Earlier this month Google announced the launch of BigQuery and I decided to take a look. Why would a Microsoft BI loyalist like me want to do this, you ask? Well, there are a number of reasons:

Looks like an even handed report to me.

See what you think about it and BigQuery.

Comments Off

Solr 4 preview: SolrCloud, NoSQL, and more

Filed under: Lucene,NoSQL,Solr,SolrCloud — Patrick Durusau @ 10:32 am

Solr 4 preview: SolrCloud, NoSQL, and more

From the post:

The first alpha release of Solr 4 is quickly approaching, bringing powerful new features to enhance existing Solr powered applications, as well as enabling new applications by further blurring the lines between full-text search and NoSQL.

The largest set of features goes by the development code-name “Solr Cloud” and involves bringing easy scalability to Solr. Distributed indexing with no single points of failure has been designed from the ground up for near real-time (NRT), and NoSQL features such as realtime-get, optimistic locking, and durable updates.

We’ve incorporated Apache ZooKeeper, the rock-solid distributed coordination project that is immune to issues like split-brain syndrome that tend to plague other hand-rolled solutions. ZooKeeper holds the Solr configuration, and contains the cluster meta-data such as hosts, collections, shards, and replicas, which are core to providing an elastic search capability.

When a new node is brought up, it will automatically be assigned a role such as becoming an additional replica for a shard. A bounced node can do a quick “peer sync” by exchanging updates with its peers in order to bring itself back up to date. New nodes, or those that have been down too long, recover by replicating the whole index of a peer while concurrently buffering any new updates.

Run, don’t walk, to learn about the new features for Solr 4.

You won’t be disappointed.

Interested to see the “….blurriing [of] the lines between full-text search and NoSQL.”

Would be even more interested to see the “…blurring of indexing and data/data formats.”

That is to say that data, along with its format, is always indexed in digital media.

So why can’t I see the data as a table, as a graph, as a …., depending upon my requirements?

No ETL, JVD – Just View Differently.

Suspect I will have to wait a while for that, but in the mean time, enjoy Solr 4 alpha.

Comments Off

My Favorite Graphs

Filed under: Data Mining,Graphs — Patrick Durusau @ 9:49 am

My Favorite Graphs by Nina Zumel

From the post:

The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all. – William Cleveland, The Elements of Graphing Data, Chapter 2

In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.

I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.

I rather like that: “…can [we] see something that would have been harder to see otherwise or that could not have been seen at all.”

A good criteria for all data mining techniques or approaches.

May 20, 2012

1940 US Census Indexing Progress Report—May 18, 2012

Filed under: Census Data,Indexing — Patrick Durusau @ 6:49 pm

1940 US Census Indexing Progress Report—May 18, 2012

From the post:

We’re finishing our 7th week of indexing and we are a breath away from having 40% of the entire collection indexed. I hear from so many people words of amazement at the things this indexing community has accomplished. In 7 weeks we’ve collectively indexed more than 55 million names. It is truly amazing. With 111,612 indexers now signed up to index and arbitrate, we have a formidable team making some great things happen. Let’s keep up the great work.

It is a popular data set but isn’t the whole story.

What do you think are the major factors that contribute to their success?

Comments Off

Finding Waldo, a flag on the moon and multiple choice tests, with R

Filed under: Graphics,Image Processing,Image Recognition,R — Patrick Durusau @ 6:28 pm

Finding Waldo, a flag on the moon and multiple choice tests, with R by Arthur Charpentier.

From the post:

I have to admit, first, that finding Waldo has been a difficult task. And I did not succeed. Neither could I correctly spot his shirt (because actually, it was what I was looking for). You know, that red-and-white striped shirt. I guess it should have been possible to look for Waldo’s face (assuming that his face does not change) but I still have problems with size factor (and resolution issues too). The problem is not that simple. At the http://mlsp2009.conwiz.dk/ conference, a price was offered for writing an algorithm in Matlab. And one can even find Mathematica codes online. But most of the those algorithms are based on the idea that we look for similarities with Waldo’s face, as described in problem 3 on http://www1.cs.columbia.edu/~blake/‘s webpage. You can find papers on that problem, e.g. Friencly & Kwan (2009) (based on statistical techniques, but Waldo is here a pretext to discuss other issues actually), or more recently (but more complex) Garg et al. (2011) on matching people in images of crowds.

Not sure how often you will want to find Waldo but then you may not be looking for Waldo.

Tipped off to this post by Simply Statistics.

Comments Off

…Commenting on Legislation and Court Decisions

Filed under: Annotation,Law,Legal Informatics — Patrick Durusau @ 6:16 pm

Anderson Releases Prototype System Enabling Citizens to Comment on Legislation and Court Decisions

Legalinformatics brings news that:

Kerry Anderson of the African Legal Information Institute (AfricanLII) has released a prototype of a new software system enabling citizens to comment on legislation, regulations, and court decisions.

There are several initiatives like this one, which is encouraging from the perspective of crowd-sourcing data for annotation.

Comments Off

Talend Updates

Filed under: Data Integration,Talend — Patrick Durusau @ 1:52 pm

Talend updates data tools to 5.1.0

From the post:

Talend has updated all the applications that run on its Open Studio unified platform to version 5.1.0. Talend’s Open Studio is an Eclipse-based environment that hosts the company’s Data Integration, Big Data, Data Quality, MDM (Master Data Management) and ESB (Enterprise Service Bus) products. The system allows a user to, using the Data Integration as an example, use a GUI to define processes that can extract data from the web, databases, files or other resources, process that data, and feed it on to other systems. The resulting definition can then be compiled into a production application.

In the 5.10 update, OpenStudio for Data Integration has, according to the release notes, been given enhanced XML mapping and support for XML documents in its SOAP, JMS, File and Mom components. A new component has also been added to help manage Kerberos security. Open Studio for Data Quality has been enhanced with new ways to apply an analysis on multiple files, and the ability to drill down through business rules to see the invalid, as well as valid, records selected by the rules.

Upgrading following a motherboard failure so I will be throwing the latest version of software on the new box.

Comments or suggestions on the Talend updates?

Comments Off

Crash Course in Erlang

Filed under: Erlang,Functional Programming — Patrick Durusau @ 9:12 am

Crash Course in Erlang by Knut Hellan.

Knut writes:

This is a summary of a talk I held Monday May 14 2012 at an XP Meetup in Trondheim. It is meant as a teaser for listeners to play with Erlang themselves.

First, some basic concepts. Erlang has a form of constant called atom that is defined on first use. They are typically used as enums or symbols in other languages. Variables in Erlang are [im]mutable so assigning a new value to an existing variable is not allowed. (emphasis added)

Not so much an introduction as a tease to get you to learn more Erlang.

Some typos but look upon those as a challenge to verify what you are reading.

I may copy this post “as is” and use it as a “critical reading/research” assignment for my class.

Then have the students debate their corrections.

That could be a very interesting exercise on not taking everything you read on blind faith, how do you verify what you have read and in the process, evaluate that material as well.

Do you develop a sense of trust for some sources as being “better” than others? Are there ones you turn to by default?

Comments Off

May 19, 2012

Developing Your Own Solr Filter

Filed under: Lucene,Solr — Patrick Durusau @ 7:45 pm

Developing Your Own Solr Filter

Rafał Kuć writes:

Sometimes Lucene and Solr out of the box functionality is not enough. When such time comes, we need to extend what Lucene and Solr gives us and create our own plugin. In todays post I’ll try to show how to develop a custom filter and use it in Solr.

Assumptions

Lets assume, that we need a filter that would allow us to reverse every word we have in a given field. So, if the input is “solr.pl” the output would be “lp.rlos”. It’s not the hardest example, but for the purpose of this entry it will be enough. One more thing – I decided to omit describing how to setup your IDE, how to compile your code, build jar and stuff like that. We will only focus on the code.

Template for creating your own Solr filter.

I persist in thinking that as “big data” arrives that the potential for ETL is going to decline. Where will you put your “big data” while processing it?

Much more likely to index “big data” in place and perform operations on the indexes to extract a subset of your “big data.”

So in terms of matching up data from different schemas or formats, what sort of filters will you be using?

Comments Off

Searching For An Honest Engineer

Filed under: Google Knowledge Graph,RDF,Semantic Web — Patrick Durusau @ 7:28 pm

Sean Golliher needs to take his lantern, to search for an honest engineer at the W3C.

Sean writes in Google Just Hi-jacked the Semantic Web Vocabulary:

Google announced they’re rolling out new enhancements to their search technology and they’re calling it the “Knowledge Graph.” For those involved in the Semantic Web Google’s “Knowledge Graph” is nothing new. After watching the video, and reading through the announcements, the Google engineers are giving the impression, to those familiar with this field, that they have created something new and innovative.

While it ‘s commendable that Google is improving search it’s interesting to note the direct translations of Google’s “new language” to the existing semantic web vocabulary. Normally engineers and researchers quote, or at least reference, the original sources of their ideas. One can’t help but notice that the semantic web isn’t mentioned in any of Google’s announcements. After watching the different reactions from the semantic web community I found that many took notice of the language Google used and how the ideas from the semantic web were repackaged as “new” and discovered by Google.

Did you know that the W3C invented the ideas for:

Knowledge Graph
Relationships Between things
Naming things Better (Taxonomy?)
Objects/Entities
Ambiguous Language (Semantics?)
Connecting Things
discover new, and relevant, things you like (Serendipity?)
meaning (Semantic?)
graph (RDF?)
things (URIs (Linked Data)?)
real-world entities and their relationships to one another: things (Linked Data?)

Really? Semantic, serendipity, graph, relationships between real-world entities?

All invented by the W3C and/or carefully crediting prior work.

Right.

Good luck with your search Sean.

Comments Off

Hands-on examples of legal search

Filed under: e-Discovery,Law,Legal Informatics,Searching — Patrick Durusau @ 7:04 pm

Hands-on examples of legal search by Michael J. Bommarito II.

From the post:

I wanted to share with the group some of my recent work on search in the legal space. I have been developing products and service models, but I thought many of the experiences or guides could be useful to you. I would love to share some of this work to help foster a “hacker” community in which we might collaborate on projects.

The first few posts are based on Amazon’s CloudSearch service. CloudSearch, as the name suggests, is a “cloud-based” search service. Once you decide what and how you would like to search, Amazon handles procuring the underlying infrastructure, scaling to required capacity, stemming, stop-wording, building indices, etc. For those of you who do not have access to “search appliances” or labor to configure products like Solr, this offers an excellent opportunity.

Pointers to several posts by Michael that range from searching U.S. Supreme Court decisions, email archives, to statutory law.

From law to eDiscovery, something for everybody!

Comments Off

Popescu by > 1100 Words

Filed under: Hadoop — Patrick Durusau @ 6:49 pm

Possible Hadoop Trajectories

In the red corner, using 1245 words to trash Hadoop, are Michael Stonebraker and Jeremy Kepner.

In the blue corner, using 82 words to show the challengers need to follow Hadoop more closely, Alex Popescu.

Sufficient ignorance can make any technology indistinguishable from hype.

Comments Off

Apache HCatalog 0.4.0 Released

Filed under: Hadoop,HCatalog — Patrick Durusau @ 5:02 pm

Apache HCatalog 0.4.0 Released by Alan Gates.

From the post:

In case you didn’t see the news, I wanted to share the announcement that HCatalog 0.4.0 is now available.

For those of you that are new to the project, HCatalog provides a metadata and table management system that simplifies data sharing between Apache Hadoop and other enterprise data systems. You can learn more about the project on the Apache project site.

From the HCatalog documentation (0.4.0):

HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, or sequence files.

HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and sequence file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

Being curious about a reference to partitions having the capacity to be multidimensional, I set off looking for information on supported data types and found:

The table shows how Pig will interpret the HCatalog data type.

HCatalog Data Type

Pig Data Type

primitives (int, long, float, double, string)

int, long, float, double, string to chararray

map (key type should be string, valuetype must be string)

map

List<any type>

bag

struct<any type fields>

tuple

The Hadoop ecosystem is evolving at a fast and furious pace!

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 24, 2012

Lima on Networks

Web sequence diagrams

May 23, 2012

Merging Market News – 23 May 2012

1 Billion Pages Visited In 2012

White House launches new digital government strategy

Forecasting: principles and practice

New UMBEL Release Gains schema.org, GeoNames Capabilities

Clustering is difficult only when it does not matter

May 22, 2012

Happy Go Lucky Identification/Merging?

MongoSF Highlights

Health Care Cost Institute

SQL Azure Labs Posts

Uncertainty Principle for Serendipity?

May 21, 2012

Call for Papers: PLoS Text Mining Collection

A Look At Google BigQuery

Solr 4 preview: SolrCloud, NoSQL, and more

My Favorite Graphs

May 20, 2012

1940 US Census Indexing Progress Report—May 18, 2012

Finding Waldo, a flag on the moon and multiple choice tests, with R

…Commenting on Legislation and Court Decisions

Talend Updates

Top 10 challenging problems in data mining

Crash Course in Erlang

May 19, 2012

Developing Your Own Solr Filter

Searching For An Honest Engineer

Hands-on examples of legal search

Popescu by > 1100 Words

Apache HCatalog 0.4.0 Released

HCatalog Data Type	Pig Data Type
primitives (int, long, float, double, string)	int, long, float, double, string to chararray
map (key type should be string, valuetype must be string)	map
List<any type>	bag
struct<any type fields>	tuple