Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 24, 2012

Lima on Networks

Filed under: Complex Networks,Complexity,Graphs,Networks — Patrick Durusau @ 2:10 pm

I saw a mention of RSA Animate – The Power of Networks by Manuel Lima over at Flowing Data.

A high speed chase through ideas but the artistry of the presentation and presenter make it hold together quite nicely.

Manuel makes the case that organization of information is more complex than trees. In fact, makes a good case for networks being a better model.

If that bothers you, you might want to cut Manuel some slack or perhaps even support the “network” (singular) model.

There are those of us who don’t think a single network is sufficient.

😉

Resources to review before viewing the video:

Science and Complexity – Warren Weaver (1948 – reprint): The paper that Manuel cites in his presentation.

Wikipedia – Complexity Not bad as Wikipedia entries go. At least a starting point.

Web sequence diagrams

Filed under: Authoring Topic Maps,Collaboration,Graphics,Visualization — Patrick Durusau @ 9:45 am

Web sequence diagrams

I ran across this while looking for information on Lucene indexing.

It may be that I am confusing the skill of the author with the utility of the interface (which may be commonly available via other sources) but I was impressed enough that I wanted to point it out.

It does seem a bit pricey ($99 for two users) but on the other hand, developing good documentation is (should be) a team based task. This would be a good way to insure a common understanding of sequences of operations.

Are there similar tools you would recommend for team based activities?

Thinking that authoring a topic map is very much a team activity. From domain experts who vet content to UI experts who create and test interfaces to experts who load and maintain content servers and others.

Keeping a common sense of purpose and interdependence (team effort) goes a long way to a successful project conclusion.

May 23, 2012

Merging Market News – 23 May 2012

Filed under: Marketing,Topic Maps — Patrick Durusau @ 6:20 pm

On the merging market front, the need for merging between different IT systems, I read happy news at:

451 Research delivers market sizing estimates for NoSQL, NewSQL and MySQL ecosystem by Matthew Aslett.

From the post:

NoSQL and NewSQL database technologies pose a long-term competitive threat to MySQL’s position as the default database for Web applications, according to a new report published by 451 Research.

The report, MySQL vs. NoSQL and NewSQL: 2011-2015, examines the competitive dynamic between MySQL and the emerging NoSQL non-relational, and NewSQL relational database technologies.

It concludes that while the current impact of NoSQL and NewSQL database technologies on MySQL is minimal, they pose a long-term competitive threat due to their adoption for new development projects. The report includes market sizing and growth estimates, with the key findings as follows:

You can get a copy of the report if you like but the important theme is that different IT vocabularies and approaches are going to be in play.

Which means translation costs between systems are going to sky rocket and be repeated with every IT spasm or change.

Unless you are hired to address integration/migration problems with topic maps of course.

On the database front, I would say things look pretty bright for topic maps!

PS: Any thoughts on how the collapse of Greece or its becoming a failed state is going to impact the merging market?

1 Billion Pages Visited In 2012

Filed under: ClueWeb2012,Data Source,Lemur Project — Patrick Durusau @ 6:08 pm

The ClueWeb12 project reports:


The Lemur Project is creating a new web dataset, tentatively called ClueWeb12, that will be a companion or successor to the ClueWeb09 web dataset. This new dataset is expected to be ready for distribution in June 2012. Dataset construction consists of crawling the web for about 1 billion pages, web page filtering, and organization into a research-ready dataset.


The crawl was initially seeded with 2,820,500 uniq URLs. This list was generated by taking the 10 million ClueWeb09 urls that had the highest PageRank scores, and then removing any page that was not in the top 90% of pages as ranked by Waterloo spam scores (i.e., least likely to be spam). Two hundred sixty-two (262) seeds were added from the most popular sites in English-speaking countries, as reported by Alexa. The number of sites selected from each country depended on its relative population size, for example, United States (71.0%), United Kindom (14.0%), Canada (7.7%), Australia (5.2%), Ireland (3.8%), and New Zealand (3.7%). Finally, Charles Clark, University of Waterloo, provided 5,950 seeds specific to travel sites.


A blacklist was used to avoid sites that are reported to distribute pornography, malware, and other material that would not be useful in a dataset intended to support a broad range of research on information retrieval and natural language understanding. The blacklist was obtained from a commercial managed URL blacklist service, URLBlacklist.com, which was downloaded on 2012-02-03. The crawler blackliset consists of urls in the malware, phishing, spyware, virusinfected, filehosting and filesharing categories. Also included in the blacklist is a small number (currently less than a dozen) of sites that opted out of the crawl.


The crawled web pages will be filtered to remove certain types of pages, for example, pages that a text classifier identifies as non-English, pornography, or spam. The dataset will contain a file that identifies each url that was removed and why it was removed. The web graph will contain all pages visited by the crawler, and will include information about redirected links.


The crawler captures an average of 10-15 million pages (and associated images, etc) per day. Its progress is documented in a daily progess report.

Are there any search engine ads: X billion of pages crawled?

White House launches new digital government strategy

Filed under: Government Data — Patrick Durusau @ 2:41 pm

White House launches new digital government strategy by Alex Howard.

From the post:

There’s a long history of people who have tried to transform the United States federal government through better use of information technology and data. It extends back to the early days of Alexander Hamilton’s ledgers of financial transaction, continues through information transmitted through telegraph, radio, telephone, and comes up to the introduction of the Internet, which has been driving dreams of better e-government for decades.

Vivek Kundra, the first U.S. chief information officer, and Aneesh Chopra, the nation’s first chief technology officer, were chosen by President Barack Obama to try to bring the federal government’s IT infrastructure and process into the 21st century, closing the IT gap that had opened between the private sector and public sector.

Today, President Obama issued a presidential memorandum on building a 21st century digital government.

In this memorandum, the president directs each major federal agency in the United States to make two key services that American citizens depend upon available on mobile devices within the next 12 months and to make “applicable” government information open and machine-readable by default. President Obama directed federal agencies to do two specific things: comply with the elements of the strategy by May 23, 2013 and to create a “/developer” page on ever major federal agency’s website.

Thought you might find some good marketing quotes for your products or services in the article or the presidential memorandum.

I do have to wince when I read:

For far too long, the American people have been forced to navigate a labyrinth of information across different Government programs in order to find the services they need.

Obviously it has been a while since President Obama has called a tech support line. My experiences recently have been good but then also very few. There is probably a relationship there.

There is going to be a lot of IT churn if not actual change so dust off your various proposals and watch for agency calls for assistance.

Don’t forget to offer topic map based solutions for agencies that want to find data once and not time after time.

Forecasting: principles and practice

Filed under: Business Intelligence,Forecasting — Patrick Durusau @ 2:23 pm

Forecasting: principles and practice: An online textbook by Rob J Hyndman and George Athanasopoulos.

From the preface:

Wel­come to our new online text­book on fore­cast­ing. This book is intended as a replace­ment for Makri­dakis, Wheel­wright and Hyn­d­man (Wiley 1998).

The entire book is avail­able online and free-of-charge. Of course, we won’t make much money doing this, but text­books never make much money any­way — the pub­lish­ers make all the money. We’d rather cre­ate some­thing that is widely used and use­ful, than have large pub­lish­ers profit from our efforts.

Even­tu­ally a print ver­sion of the book will be avail­able to pur­chase on Ama­zon, but not until a few more chap­ters are written.

This text­book is intended to pro­vide a com­pre­hen­sive intro­duc­tion to fore­cast­ing meth­ods and present enough infor­ma­tion about each method for read­ers to use them sen­si­bly. We don’t attempt to give a thor­ough dis­cus­sion of the the­o­ret­i­cal details behind each method, although the ref­er­ences at the end of each chap­ter will fill in many of those details.

The book is writ­ten for three audi­ences: (1) people find­ing them­selves doing fore­cast­ing in busi­ness when they may not have had any for­mal train­ing in the area; (2) undergraduate stu­dents study­ing busi­ness; (3) MBA stu­dents doing a fore­cast­ing elec­tive. We use it our­selves for a second-year sub­ject for stu­dents under­tak­ing a Bach­e­lor of Com­merce degree at Monash Uni­ver­sity, Australia.

Should be a useful resource for learning the forecasting “lingo” in a business context. Or for learning forecasting for that matter.

The middle chapters on regression, as the authors point out, are unfinished by they hope to have the book complete by the end of 2012.

It could be a really nice gesture on our part if we all read a chapter or so and suggested corrections to improvements to the prose.

New UMBEL Release Gains schema.org, GeoNames Capabilities

Filed under: Geographic Data,GeoNames,Schema.org,UMBEL — Patrick Durusau @ 2:12 pm

New UMBEL Release Gains schema.org, GeoNames Capabilities by Mike Bergman.

From the post:

We are pleased to announce the release of version 1.05 of UMBEL, which now has linkages to schema.org [6] and GeoNames [1]. UMBEL has also been split into ‘core’ and ‘geo’ modules. The resulting smaller size of UMBEL ‘core’ — now some 26,000 reference concepts — has also enabled us to create a full visualization of UMBEL’s content graph.

Mapping to schema.org

The first notable change in UMBEL v. 1.05 is its mapping to schema.org. schema.org is a collection of schema (usable as HTML tags) that webmasters can use to markup their pages in ways recognized by major search providers. schema.org was first developed and organized by the major search engines of Bing, Google and Yahoo!; later Yandex joined as a sponsor. Now many groups are supporting schema.org and contributing vocabularies and schema.

You will appreciate the details of the writeup and like the visualization. Quite impressive!

PS: As if you didn’t know:

http://umbel.org/

This is the official Web site for the UMBEL Vocabulary and Reference Concept Ontology (namespace: umbel). UMBEL is the Upper Mapping and Binding Exchange Layer, designed to help content interoperate on the Web.

Clustering is difficult only when it does not matter

Filed under: Clustering — Patrick Durusau @ 9:23 am

Clustering is difficult only when it does not matter by Amit Daniely, Nati Linial, Michael Saks.

Abstract:

Numerous papers ask how difficult it is to cluster data. We suggest that the more relevant and interesting question is how difficult it is to cluster data sets {\em that can be clustered well}. More generally, despite the ubiquity and the great importance of clustering, we still do not have a satisfactory mathematical theory of clustering. In order to properly understand clustering, it is clearly necessary to develop a solid theoretical basis for the area. For example, from the perspective of computational complexity theory the clustering problem seems very hard. Numerous papers introduce various criteria and numerical measures to quantify the quality of a given clustering. The resulting conclusions are pessimistic, since it is computationally difficult to find an optimal clustering of a given data set, if we go by any of these popular criteria. In contrast, the practitioners’ perspective is much more optimistic. Our explanation for this disparity of opinions is that complexity theory concentrates on the worst case, whereas in reality we only care for data sets that can be clustered well.

We introduce a theoretical framework of clustering in metric spaces that revolves around a notion of “good clustering”. We show that if a good clustering exists, then in many cases it can be efficiently found. Our conclusion is that contrary to popular belief, clustering should not be considered a hard task.

Considering that clustering is a first step towards merging, you will find the following encouraging:

From the practitioner’s viewpoint, “clustering is either easy or pointless” &emdash; that is, whenever the input admits a good clustering, finding it is feasible. Our analysis provides some support to this view.

I would caution that the authors are working with metric spaces.

It isn’t clear to me that clustering based on values in non-metric spaces would share the same formal characteristics.

Comments or pointers to work on clustering in non-metric spaces?

May 22, 2012

Happy Go Lucky Identification/Merging?

Filed under: Identity,Merging — Patrick Durusau @ 3:32 pm

MIT News: New mathematical framework formalizes oddball programming techniques

From the post:

Two years ago, Martin Rinard’s group at MIT’s Computer Science and Artificial Intelligence Laboratory proposed a surprisingly simple way to make some computer procedures more efficient: Just skip a bunch of steps. Although the researchers demonstrated several practical applications of the technique, dubbed loop perforation, they realized it would be a hard sell. “The main impediment to adoption of this technique,” Imperial College London’s Cristian Cadar commented at the time, “is that developers are reluctant to adopt a technique where they don’t exactly understand what it does to the program.”

I like that for making topic maps scale, “…skip a bunch of steps….”

Topic maps, the semantic web and similar semantic ventures are erring on the side of accuracy.

We are often mistaken about facts, faces, identifications in semantic terminology.

Why think we can build programs or machines that can do better?

Let’s stop rolling the identification stone up the hill.

Ask “how accurate does the identification/merging need to be?”

The answer for aiming a missile is probably different than sorting emails in a discovery process.

If you believe in hyperlinks:

Proving Acceptability Properties of Relaxed Nondeterministic Approximate Programs Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin Rinard, Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2012) Beijing, China June 2012

From Martin Rinard’s publication page.

Has other interesting reading.

MongoSF Highlights

Filed under: Conferences,MongoDB — Patrick Durusau @ 2:57 pm

I am on the Mongo mailing list and so got the monthly news about MongoDB, which included a list of highlights from MongoSF:

Except in the email the links had all the tracking trash that marketing types seem to think is important.

I visited the 10gen site and harvested the direct links for your convenience. I didn’t insert tracking trash for my blog.

Enjoy!

PS: It would really be nice to get emails that have the tracking trash if you insist but also clean links that can be forwarded to others, used in blog posts, real information type activities. Not to single out 10gen, I see it every day. From people who should know better.

PPS: There are more presentations to view at: Featured Presentations.

Health Care Cost Institute

Filed under: Data Source,Health care — Patrick Durusau @ 2:37 pm

Health Care Cost Institute

I can’t give you a clean URL but on Monday (21 May 2012), the Washington Post ran a story on the Health Care Cost Institute, which had the following quotes:

This morning a new nonprofit called the Health Care Cost Institute will roll out a database of 5 billion health insurance claims (all stripped of the individual health plan’s identity, to address privacy concerns).

This is the first study to use the HCCI data, although more are in the works. Gaynor has been inundated with about 130 requests from health policy researchers to use the database. While his team sifts through those, three approved studies are already tackling big health policy questions.

“There is immense interest in gaining access,” says HCCI executive director David Newman. “We’re having trouble keeping up with that.” (emphasis added)

Sorry, that went by a little fast. The data has already been scrubbed so why the choke point of the Health Care Cost Insitute on the data?

Spin it up to one or more clouds that support free public storage for data sets of public interest.

Problem of sorting through access request is solved.

Just maybe researchers will want to address other questions, ones that aren’t necessarily about costs. And/or combine this data with other data. Like data on local pollution. (Although you would need historical data to make that work.)

Mapping this data set to other data sets could only magnify its importance.

Many thanks are owed to the Health Care Cost Institute for securing the data set.

But our thanks should not include electing the HCCI as censor of uses of this data set.

SQL Azure Labs Posts

Filed under: Azure Marketplace,Microsoft,SQL,Windows Azure,Windows Azure Marketplace — Patrick Durusau @ 10:36 am

Roger Jennings writes in Recent Articles about SQL Azure Labs and Other Value-Added Windows Azure SaaS Previews: A Bibliography:

I’ve been concentrating my original articles for the past six months or so on SQL Azure Labs, Apache Hadoop on Windows Azure and SQL Azure Federations previews, which I call value-added offerings. I use the term value-added because Microsoft doesn’t charge for their use, other than Windows Azure compute, storage and bandwidth costs or SQL Azure monthly charges and bandwidth costs for some of the applications, such as Codename “Cloud Numerics” and SQL Azure Federations.

As of 22 May 2012, there are forty-four (44) posts in the following categories:

  • Windows Azure Marketplace DataMarket plus Codenames “Data Hub” and “Data Transfer” from SQL Azure Labs
  • Apache Hadoop on Windows Azure from the SQL Server Team
  • Codename “Cloud Numerics” from SQL Azure Labs
  • Codename “Social Analytics from SQL Azure Labs
  • Codename “Data Explorer” from SQL Azure Labs
  • SQL Azure Federations from the SQL Azure Team

If you need quick guides and/or incentives to use Windows Azure, try these on for size.

Uncertainty Principle for Serendipity?

Filed under: Analytics,Data Integration — Patrick Durusau @ 10:22 am

Curt Monash writes in Cool analytic stories

There are several reasons it’s hard to confirm great analytic user stories. First, there aren’t as many jaw-dropping use cases as one might think. For as I wrote about performance, new technology tends to make things better, but not radically so. After all, if its applications are …

… all that bloody important, then probably people have already been making do to get it done as best they can, even in an inferior way.

Further, some of the best stories are hard to confirm; even the famed beer/diapers story isn’t really true. Many application areas are hard to nail down due to confidentiality, especially but not only in such “adversarial” domains as anti-terrorism, anti-spam, or anti-fraud.

How will we “know” when better data display/mining techniques enable more serendipity?

Anecdotal stories about serendipity abound.

Measuring serendipity requires knowing: (rate of serendipitous discoveries x importance of serendipitous discoveries)/ opportunity for serendipitous discoveries.

Need to add in a multiplier effect for the impact that one serendipitous discovery may have to create opportunities or other serendipitous discoveries (a serendipitous criticality point) and probably some other things I have overlooked.

What would you add to the equation?

Realizing that we may be staring at the “right” answer and never realize it.

How’s that for an uncertainty principle?

May 21, 2012

Call for Papers: PLoS Text Mining Collection

Filed under: Data Mining,Text Mining — Patrick Durusau @ 7:15 pm

Call for Papers: PLoS Text Mining Collection by Camron Assadi.

From the post:

The Public Library of Science (PLoS) seeks submissions in the broad field of text-mining research for a collection to be launched across all of its journals in 2013. All submissions submitted before October 30th, 2012 will be considered for the launch of the collection. Please read the following post for further information on how to submit your article.

The scientific literature is exponentially increasing in size, with thousands of new papers published every day. Few researchers are able to keep track of all new publications, even in their own field, reducing the quality of scholarship and leading to undesirable outcomes like redundant publication. While social media and expert recommendation systems provide partial solutions to the problem of keeping up with the literature, systematically identifying relevant articles and extracting key information from them can only come through automated text-mining technologies.

Research in text mining has made incredible advances over the last decade, driven through community challenges and increasingly sophisticated computational technologies. However, the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model. As Open Access publishing yields an ever-growing archive of unrestricted full-text articles, text mining will play an increasingly important role in drilling down to essential research and data in scientific literature in the 21st century scholarly landscape.

As part of its commitment to realizing the maximal utility of Open Access literature, PLoS is launching a collection of articles dedicated to highlighting the importance of research in the area of text mining. The launch of this Text Mining Collection complements related PLoS Collections on Open Access and Altmetrics (forthcoming), as well as the recent release of the PLoS Application Programming Interface, which provides an open API to PLoS journal content.

Highly recommend that you follow up on this publication opportunity.

I am less certain that: “…the promise of text mining to accelerate and enhance research largely has not yet been fulfilled, primarily since the vast majority of the published scientific literature is not published under an Open Access model.”

Don’t recall seeing any research on a connection between a lack of Open Access and failure of text mining to accelerate research.

CiteSeer and arXiv have long been freely available in full text. If research were going to leap forward from open access, the opportunity has been present.

Open access does advance research and discovery but it isn’t a magic bullet. Accelerating and enhancing research is going to require more than simply indexing literature. A lot more.

A Look At Google BigQuery

Filed under: Business Intelligence,Google BigQuery — Patrick Durusau @ 6:55 pm

A Look At Google BigQuery

Chris Webb writes:

Over the years I’ve written quite a few posts about Google’s BI capabilities. Google never seems to get mentioned much as a BI tools vendor but to me it’s clear that it’s doing a lot in this area and is consciously building up its capabilities; you only need to look at things like Fusion Tables (check out these recently-added features), Google Refine and of course Google Docs to see that it’s pursuing a self-service, information-worker-led vision of BI that’s very similar to the one that Microsoft is pursuing with PowerPivot and Data Explorer.

Earlier this month Google announced the launch of BigQuery and I decided to take a look. Why would a Microsoft BI loyalist like me want to do this, you ask? Well, there are a number of reasons:

Looks like an even handed report to me.

See what you think about it and BigQuery.

How do things go viral? Information diffusion in social networks.

Filed under: Social Networks,Viral — Patrick Durusau @ 10:53 am

How do things go viral? Information diffusion in social networks by Maksim Tsvetovat.

May 22, 2012 – 10 AM Pacific Time, Webcast

From the post:

“Going viral” is a holy grail of internet marketing — but beside the well-known memes and viral campaigns, there is a slower and quieter process of information diffusion. In fact, information diffusion is at the root of “viral nature” of some information. In this webcast, we will talk about the viral nature of information, adoption of attitudes and memes, and the way social networks evolve at the same time as people’s attitudes and desires. We will demonstrate some of these principles using models built in Python.

Admit it or not, we all want other people to use software or ideas that we like. If nothing else (like sales income) it provides validation.

Having a bunch of people like our stuff, is even more validation.

Watch the video. Maybe it will work for you!

Solr 4 preview: SolrCloud, NoSQL, and more

Filed under: Lucene,NoSQL,Solr,SolrCloud — Patrick Durusau @ 10:32 am

Solr 4 preview: SolrCloud, NoSQL, and more

From the post:

The first alpha release of Solr 4 is quickly approaching, bringing powerful new features to enhance existing Solr powered applications, as well as enabling new applications by further blurring the lines between full-text search and NoSQL.

The largest set of features goes by the development code-name “Solr Cloud” and involves bringing easy scalability to Solr. Distributed indexing with no single points of failure has been designed from the ground up for near real-time (NRT), and NoSQL features such as realtime-get, optimistic locking, and durable updates.

We’ve incorporated Apache ZooKeeper, the rock-solid distributed coordination project that is immune to issues like split-brain syndrome that tend to plague other hand-rolled solutions. ZooKeeper holds the Solr configuration, and contains the cluster meta-data such as hosts, collections, shards, and replicas, which are core to providing an elastic search capability.

When a new node is brought up, it will automatically be assigned a role such as becoming an additional replica for a shard. A bounced node can do a quick “peer sync” by exchanging updates with its peers in order to bring itself back up to date. New nodes, or those that have been down too long, recover by replicating the whole index of a peer while concurrently buffering any new updates.

Run, don’t walk, to learn about the new features for Solr 4.

You won’t be disappointed.

Interested to see the “….blurriing [of] the lines between full-text search and NoSQL.”

Would be even more interested to see the “…blurring of indexing and data/data formats.”

That is to say that data, along with its format, is always indexed in digital media.

So why can’t I see the data as a table, as a graph, as a …., depending upon my requirements?

No ETL, JVD – Just View Differently.

Suspect I will have to wait a while for that, but in the mean time, enjoy Solr 4 alpha.

My Favorite Graphs

Filed under: Data Mining,Graphs — Patrick Durusau @ 9:49 am

My Favorite Graphs by Nina Zumel

From the post:

The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all. – William Cleveland, The Elements of Graphing Data, Chapter 2

In this article, I will discuss some graphs that I find extremely useful in my day-to-day work as a data scientist. While all of them are helpful (to me) for statistical visualization during the analysis process, not all of them will necessarily be useful for presentation of final results, especially to non-technical audiences.

I tend to follow Cleveland’s philosophy, quoted above; these graphs show me — and hopefully you — aspects of data and models that I might not otherwise see. Some of them, however, are non-standard, and tend to require explanation. My purpose here is to share with our readers some ideas for graphical analysis that are either useful to you directly, or will give you some ideas of your own.

I rather like that: “…can [we] see something that would have been harder to see otherwise or that could not have been seen at all.”

A good criteria for all data mining techniques or approaches.

You will like the graphs as well.

May 20, 2012

1940 US Census Indexing Progress Report—May 18, 2012

Filed under: Census Data,Indexing — Patrick Durusau @ 6:49 pm

1940 US Census Indexing Progress Report—May 18, 2012

From the post:

We’re finishing our 7th week of indexing and we are a breath away from having 40% of the entire collection indexed. I hear from so many people words of amazement at the things this indexing community has accomplished. In 7 weeks we’ve collectively indexed more than 55 million names. It is truly amazing. With 111,612 indexers now signed up to index and arbitrate, we have a formidable team making some great things happen. Let’s keep up the great work.

It is a popular data set but isn’t the whole story.

What do you think are the major factors that contribute to their success?

Finding Waldo, a flag on the moon and multiple choice tests, with R

Filed under: Graphics,Image Processing,Image Recognition,R — Patrick Durusau @ 6:28 pm

Finding Waldo, a flag on the moon and multiple choice tests, with R by Arthur Charpentier.

From the post:

I have to admit, first, that finding Waldo has been a difficult task. And I did not succeed. Neither could I correctly spot his shirt (because actually, it was what I was looking for). You know, that red-and-white striped shirt. I guess it should have been possible to look for Waldo’s face (assuming that his face does not change) but I still have problems with size factor (and resolution issues too). The problem is not that simple. At the http://mlsp2009.conwiz.dk/ conference, a price was offered for writing an algorithm in Matlab. And one can even find Mathematica codes online. But most of the those algorithms are based on the idea that we look for similarities with Waldo’s face, as described in problem 3 on http://www1.cs.columbia.edu/~blake/‘s webpage. You can find papers on that problem, e.g. Friencly & Kwan (2009) (based on statistical techniques, but Waldo is here a pretext to discuss other issues actually), or more recently (but more complex) Garg et al. (2011) on matching people in images of crowds.

Not sure how often you will want to find Waldo but then you may not be looking for Waldo.

Tipped off to this post by Simply Statistics.

…Commenting on Legislation and Court Decisions

Filed under: Annotation,Law,Legal Informatics — Patrick Durusau @ 6:16 pm

Anderson Releases Prototype System Enabling Citizens to Comment on Legislation and Court Decisions

Legalinformatics brings news that:

Kerry Anderson of the African Legal Information Institute (AfricanLII) has released a prototype of a new software system enabling citizens to comment on legislation, regulations, and court decisions.

There are several initiatives like this one, which is encouraging from the perspective of crowd-sourcing data for annotation.

Talend Updates

Filed under: Data Integration,Talend — Patrick Durusau @ 1:52 pm

Talend updates data tools to 5.1.0

From the post:

Talend has updated all the applications that run on its Open Studio unified platform to version 5.1.0. Talend’s Open Studio is an Eclipse-based environment that hosts the company’s Data Integration, Big Data, Data Quality, MDM (Master Data Management) and ESB (Enterprise Service Bus) products. The system allows a user to, using the Data Integration as an example, use a GUI to define processes that can extract data from the web, databases, files or other resources, process that data, and feed it on to other systems. The resulting definition can then be compiled into a production application.

In the 5.10 update, OpenStudio for Data Integration has, according to the release notes, been given enhanced XML mapping and support for XML documents in its SOAP, JMS, File and Mom components. A new component has also been added to help manage Kerberos security. Open Studio for Data Quality has been enhanced with new ways to apply an analysis on multiple files, and the ability to drill down through business rules to see the invalid, as well as valid, records selected by the rules.

Upgrading following a motherboard failure so I will be throwing the latest version of software on the new box.

Comments or suggestions on the Talend updates?

Top 10 challenging problems in data mining

Filed under: Data Mining — Patrick Durusau @ 1:12 pm

Top 10 challenging problems in data mining by Sandro Saitta (March 27, 2008)

I mention the date of this post because the most recent response to it was four days ago, May 15, 2012.

I should write a post that gets comments that long after publication!

Sandro writes:

In a previous post, I wrote about the top 10 data mining algorithms, a paper that was published in Knowledge and Information Systems. The “selective” process is the same as the one that has been used to identify the most important (according to answers of the survey) data mining problems. The paper by Yang and Wu has been published (in 2006) in the International Journal of Information Technology & Decision Making. The paper contains the following problems (in no specific order):

  • Developing a unifying theory of data mining
  • Scaling up for high dimensional data and high speed data streams
  • Mining sequence data and time series data
  • Mining complex knowledge from complex data
  • Data mining in a network setting
  • Distributed data mining and mining multi-agent data
  • Data mining for biological and environmental problems
  • Data Mining process-related problems
  • Security, privacy and data integrity
  • Dealing with non-static, unbalanced and cost-sensitive data
  • It’s a little over five years later.

    Same list? Different list?

    BTW, the 2006 article by Yang and Wu, along with slides, can be found at: 10 Challenging Problems in Data Mining Research

    The full citation of the article is:

    Qiang Yang and Xindong Wu (Contributors: Pedro Domingos, Charles Elkan, Johannes Gehrke, Jiawei Han, David Heckerman, Daniel Keim, Jiming Liu, David Madigan, Gregory Piatetsky-Shapiro, Vijay V. Raghavan, Rajeev Rastogi, Salvatore J. Stolfo, Alexander Tuzhilin, and Benjamin W. Wah), 10 Challenging Problems in Data Mining Research, International Journal of Information Technology & Decision Making, Vol. 5, No. 4, 2006, 597-604.

    While searching for this paper I encountered:

    Xindong Wu’s Publications in Data Mining and Machine Learning

    Pick any paper at random and you are likely to learn something new.

    An Example of Social Network Analysis with R using Package igraph

    Filed under: igraph,Networks,R,Social Networks — Patrick Durusau @ 10:33 am

    An Example of Social Network Analysis with R using Package igraph by Yanchang Zhao.

    From the post:

    This post presents an example of social network analysis with R using package igraph.

    The data to analyze is Twitter text data of @RDataMining used in the example of Text Mining, and it can be downloaded as file “termDocMatrix.rdata” at the Data webpage. Putting it in a general scenario of social networks, the terms can be taken as people and the tweets as groups on LinkedIn, and the term-document matrix can then be taken as the group membership of people. We will build a network of terms based on their co-occurrence in the same tweets, which is similar with a network of people based on their group memberships.

    I like the re-use of traditional social network analysis with tweets.

    And the building of a network of terms based on co-occurrence.

    May or may not serve your purposes but:

    If you don’t look, you won’t see.

    Crash Course in Erlang

    Filed under: Erlang,Functional Programming — Patrick Durusau @ 9:12 am

    Crash Course in Erlang by Knut Hellan.

    Knut writes:

    This is a summary of a talk I held Monday May 14 2012 at an XP Meetup in Trondheim. It is meant as a teaser for listeners to play with Erlang themselves.

    First, some basic concepts. Erlang has a form of constant called atom that is defined on first use. They are typically used as enums or symbols in other languages. Variables in Erlang are [im]mutable so assigning a new value to an existing variable is not allowed. (emphasis added)

    Not so much an introduction as a tease to get you to learn more Erlang.

    Some typos but look upon those as a challenge to verify what you are reading.

    I may copy this post “as is” and use it as a “critical reading/research” assignment for my class.

    Then have the students debate their corrections.

    That could be a very interesting exercise on not taking everything you read on blind faith, how do you verify what you have read and in the process, evaluate that material as well.

    Do you develop a sense of trust for some sources as being “better” than others? Are there ones you turn to by default?

    May 19, 2012

    Developing Your Own Solr Filter

    Filed under: Lucene,Solr — Patrick Durusau @ 7:45 pm

    Developing Your Own Solr Filter

    Rafał Kuć writes:

    Sometimes Lucene and Solr out of the box functionality is not enough. When such time comes, we need to extend what Lucene and Solr gives us and create our own plugin. In todays post I’ll try to show how to develop a custom filter and use it in Solr.

    Assumptions

    Lets assume, that we need a filter that would allow us to reverse every word we have in a given field. So, if the input is “solr.pl” the output would be “lp.rlos”. It’s not the hardest example, but for the purpose of this entry it will be enough. One more thing – I decided to omit describing how to setup your IDE, how to compile your code, build jar and stuff like that. We will only focus on the code.

    Template for creating your own Solr filter.

    I persist in thinking that as “big data” arrives that the potential for ETL is going to decline. Where will you put your “big data” while processing it?

    Much more likely to index “big data” in place and perform operations on the indexes to extract a subset of your “big data.”

    So in terms of matching up data from different schemas or formats, what sort of filters will you be using?

    Searching For An Honest Engineer

    Filed under: Google Knowledge Graph,RDF,Semantic Web — Patrick Durusau @ 7:28 pm

    Sean Golliher needs to take his lantern, to search for an honest engineer at the W3C.

    Sean writes in Google Just Hi-jacked the Semantic Web Vocabulary:

    Google announced they’re rolling out new enhancements to their search technology and they’re calling it the “Knowledge Graph.” For those involved in the Semantic Web Google’s “Knowledge Graph” is nothing new. After watching the video, and reading through the announcements, the Google engineers are giving the impression, to those familiar with this field, that they have created something new and innovative.

    While it ‘s commendable that Google is improving search it’s interesting to note the direct translations of Google’s “new language” to the existing semantic web vocabulary. Normally engineers and researchers quote, or at least reference, the original sources of their ideas. One can’t help but notice that the semantic web isn’t mentioned in any of Google’s announcements. After watching the different reactions from the semantic web community I found that many took notice of the language Google used and how the ideas from the semantic web were repackaged as “new” and discovered by Google.

    Did you know that the W3C invented the ideas for:

    • Knowledge Graph
    • Relationships Between things
    • Naming things Better (Taxonomy?)
    • Objects/Entities
    • Ambiguous Language (Semantics?)
    • Connecting Things
    • discover new, and relevant, things you like (Serendipity?)
    • meaning (Semantic?)
    • graph (RDF?)
    • things (URIs (Linked Data)?)
    • real-world entities and their relationships to one another: things (Linked Data?)

    ?

    Really? Semantic, serendipity, graph, relationships between real-world entities?

    All invented by the W3C and/or carefully crediting prior work.

    Right.

    Good luck with your search Sean.

    Hands-on examples of legal search

    Filed under: e-Discovery,Law,Legal Informatics,Searching — Patrick Durusau @ 7:04 pm

    Hands-on examples of legal search by Michael J. Bommarito II.

    From the post:

    I wanted to share with the group some of my recent work on search in the legal space. I have been developing products and service models, but I thought many of the experiences or guides could be useful to you. I would love to share some of this work to help foster a “hacker” community in which we might collaborate on projects.

    The first few posts are based on Amazon’s CloudSearch service. CloudSearch, as the name suggests, is a “cloud-based” search service. Once you decide what and how you would like to search, Amazon handles procuring the underlying infrastructure, scaling to required capacity, stemming, stop-wording, building indices, etc. For those of you who do not have access to “search appliances” or labor to configure products like Solr, this offers an excellent opportunity.

    Pointers to several posts by Michael that range from searching U.S. Supreme Court decisions, email archives, to statutory law.

    From law to eDiscovery, something for everybody!

    Popescu by > 1100 Words

    Filed under: Hadoop — Patrick Durusau @ 6:49 pm

    Possible Hadoop Trajectories

    In the red corner, using 1245 words to trash Hadoop, are Michael Stonebraker and Jeremy Kepner.

    In the blue corner, using 82 words to show the challengers need to follow Hadoop more closely, Alex Popescu.

    Sufficient ignorance can make any technology indistinguishable from hype.

    Apache HCatalog 0.4.0 Released

    Filed under: Hadoop,HCatalog — Patrick Durusau @ 5:02 pm

    Apache HCatalog 0.4.0 Released by Alan Gates.

    From the post:

    In case you didn’t see the news, I wanted to share the announcement that HCatalog 0.4.0 is now available.

    For those of you that are new to the project, HCatalog provides a metadata and table management system that simplifies data sharing between Apache Hadoop and other enterprise data systems. You can learn more about the project on the Apache project site.

    From the HCatalog documentation (0.4.0):

    HCatalog is a table and storage management layer for Hadoop that enables users with different data processing tools – Pig, MapReduce, and Hive – to more easily read and write data on the grid. HCatalog’s table abstraction presents users with a relational view of data in the Hadoop distributed file system (HDFS) and ensures that users need not worry about where or in what format their data is stored – RCFile format, text files, or sequence files.

    HCatalog supports reading and writing files in any format for which a SerDe can be written. By default, HCatalog supports RCFile, CSV, JSON, and sequence file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.

    Being curious about a reference to partitions having the capacity to be multidimensional, I set off looking for information on supported data types and found:

    The table shows how Pig will interpret the HCatalog data type.

    HCatalog Data Type

    Pig Data Type

    primitives (int, long, float, double, string)

    int, long, float, double, string to chararray

    map (key type should be string, valuetype must be string)

    map

    List<any type>

    bag

    struct<any type fields>

    tuple

    The Hadoop ecosystem is evolving at a fast and furious pace!

    « Newer PostsOlder Posts »

    Powered by WordPress