How Hadoop HDFS Works – Cartoon
Explanations don’t have to be complicated, they just have to be clear. Here is a clear one about HDFS. (click on the image for a full size view)
First saw this on myNoSQL.
How Hadoop HDFS Works – Cartoon
Explanations don’t have to be complicated, they just have to be clear. Here is a clear one about HDFS. (click on the image for a full size view)
First saw this on myNoSQL.
FOSDEM 2012 – Free and Open Software Developers’ European Meeting – 4-5 February 2012
From the webpage:
Not coming to FOSDEM? Want to participate anyway? Do not despair! This year, the FOSDEM team is proud to announce the availability of streaming video from a select number of our rooms.
Thanks to the support of Fluendo, we’ll be able to provide you with Ogg Theora and WebM versions of our streams.
Details on where and how to access these streams will be posted later as FOSDEM draws nearer. Watch this space!
The social aspect and random ideas/discussions at conferences are real values that need to continue.
But, in tight budgetary times, videos (live or recorded) of presentations, need to become the norm for all conferences.
An update on Apache Hadoop 1.0 from Cloudera by Charles Zedlewski.
From the post:
Some users & customers have asked about the most recent release of Apache Hadoop, v1.0: what’s in it, what it followed and what it preceded. To explain this we should start with some basics of how Apache projects release software:
By and large, in Apache projects new features are developed on a main codeline known as “trunk.” Occasionally very large features are developed on their own branches with the expectation they’ll later merge into trunk. While new features usually land in trunk before they reach a release, there is not much expectation of quality or stability. Periodically, candidate releases are branched from trunk. Once a candidate release is branched it usually stops getting new features. Bugs are fixed and after a vote, a release is declared for that particular branch. Any member of the community can create a branch for a release and name it whatever they like.
About as clear an explanation of the Apache process and current state of Hadoop releases as is possible, given the facts Charles had to work with.
Still, for the average Cloudera release user I think something along the lines of:
There has been some confusion over the jump from 0.2* versions of Hadoop to a release of Hadoop 1.0 at Apache.
You have not missed various 0.3* and later releases!
Like political candidates, Apache releases can call themselves anything they like. The Hadoop project leaders decided to call a recent release Hadoop 1.0. Given the confusion this caused, maybe we will see more orderly naming in the future. Maybe not.
If you have CDH3, then you have all the features of the recent “Hadoop 1.0” and have had them for almost one year. (If you don’t have CDH3, you may wish to consider upgrading.)
(then conclude with)
[T]he CDH engineering team is comprised of more than 20 engineers that are committers and PMC members of the various Apache projects who can shape the innovation of the extended community into a single coherent system. It is why we believe demonstrated leadership in open source contribution is the only way to harness the open innovation of the Apache Hadoop ecosystem.
would have been better.
People are looking for a simple explanation with some reassurance that all is well.
Searching relational content with Lucene’s BlockJoinQuery
Mike McCandless writes:
Lucene’s 3.4.0 release adds a new feature called index-time join (also sometimes called sub-documents, nested documents or parent/child documents), enabling efficient indexing and searching of certain types of relational content.
Most search engines can’t directly index relational content, as documents in the index logically behave like a single flat database table. Yet, relational content is everywhere! A job listing site has each company joined to the specific listings for that company. Each resume might have separate list of skills, education and past work experience. A music search engine has an artist/band joined to albums and then joined to songs. A source code search engine would have projects joined to modules and then files.
Mike covers how to index relational content with Lucene 3.4.0 as well as the current limitations on that relational indexing. Current work is projected to resolve some of those limitations.
This feature will be immediately useful in a number of contexts.
Even more promising is the development of thinking about indexing as more than term -> document. Both sides of that operator need more granularity.
ISBRA 2012 : International Symposium on Bioinformatics Research and Applications
Dates:
When May 21, 2012 – May 23, 2012
Where Dallas, Texas
Submission Deadline Feb 6, 2012
Notification Due Mar 5, 2012
Final Version Due Mar 15, 2012
From the call for papers:
The International Symposium on Bioinformatics Research and Applications (ISBRA) provides a forum for the exchange of ideas and results among researchers, developers, and practitioners working on all aspects of bioinformatics and computational biology and their applications. Submissions presenting original research are solicited in all areas of bioinformatics and computational biology, including the development of experimental or commercial systems.
Geoff This link is broken. New URL (and somewhat different explanatory text): http://nigelsmall.com/geoff
Nigel Small writes:
Geoff is a declarative notation for representing graph data within concise human-readable text, designed specifically with Neo4j in mind. It can be used to store snapshots within a flat file or to transmit data changes over a network stream.
A Geoff data set or file consists of a sequence of rules. Each rule comprises a descriptor and an optional set of data held as key:value pairs. The descriptor is a sequence of tokens, somewhat similar to the notation used in the Cypher query language and can designate additive or subtractive requirements for nodes and relationships as well as manipulations to index entries.
This looks like it will repay close study fairly quickly. More to follow.
From the about page:
wikistream is an experimental visualization of realtime edits in major language Wikipedias. Every time someone updates or creates a Wikipedia article you will see it ever so briefly in this list. And if someone uploads an image file to the Wikimedia Commons you should see the image background update.
If you’d like to pause the display at any time just hit ‘p’. To continue scrolling hit ‘p’ again. If you hover over the link you should see the comment associated with the change.
Hopefully wikistream provides a hint of just how active the community is around Wikipedia. wikistream was created to help recognize the level of involvement of folks around the world, who are actively engaged in making Wikipedia the amazing resource that it is.
wikistream was also an excuse to experiment with node.js and redis. node connects to the Wikimedia IRC server, joins the wikipedia channels and pushes updates into redis via pub/sub, which the node webapp is then able to deliver into your browser through the magic of socket.io If you are curious, want to add/suggest something, or run the app yourself, checkout the code on GitHub
I’m not sure if this is eye-candy or something different.
Could be a testing stream (low-speed) for unpredictable content to be integrated into larger whole.
Could be navigational exercise of data or those editing the data.
Could be filtering exercise.
Could be identification exercise, across multiple natural languages.
Or, could be something completely different. What do you think?
Spark: Cluster Computing with Working Sets
From the post:
One of the aspects you can’t miss even as you just begin reading this paper is the strong scent of functional programming that the design of Spark bears. The use of FP idioms is quite widespread across the architecture of Spark such as the ability to restore a partition from by applying a closure block, operations such as reduce and map/collect, distributed accumulators etc. It would suffice to say that it is a very functional system. Pun intended!
Spark is written in Scala and is well suited for the class of applications that reuse a working set of data across multiple parallel operations. It claims to outperform Hadoop by 10x in iterative machine learning jobs, and has been tried successfully to interactively query a 39 GB dataset with sub-second response time!
Its is built on top of Mesos, a resource management infrastructure, that lets multiple parallel applications share a cluster in a fine-grained manner and provides an API for applications to launch tasks on a cluster.
Developers write a driving program that orchestrates various parallel operations. Spark’s programming model provides two abstractions to work with large datasets : resilient distributed datasets and parallel operations. In addition it supports two kinds of shared variables.
If more technical papers had previews like this one, more technical papers would be read!
Interesting approach on first blush. Not sure I make that much out of sub-second queries on 39 GB dataset as that is a physical memory issue these days. I do like the idea of sets of data, subject to repeated operations.
New: Spark Project Homepage.
Relevance Tuning and Competitive Advantage via Search Analytics
It must be all the “critical” evaluation of infographics I have been reading but I found myself wondering about the following paragraph:
This slide shows how Search Analytics can be used to help with A/B testing. Concretely, in this slide we see two Solr Dismax handlers selected on the right side. If you are not familiar with Solr, think of a Dismax handler as an API that search applications call to execute searches. In this example, each Dismax handler is configured differently and thus each of them ranks search hits slightly differently. On the graph we see the MRR (see Wikipedia page for Mean Reciprocal Rank details) for both Dismax handlers and we can see that the one corresponding to the blue line is performing much better. That is, users are clicking on search hits closer to the top of the search results page, which is one of several signals of this Dismax handler providing better relevance ranking than the other one. Once you have a system like this in place you can add more Dismax handlers and compare 2 or more of them at a time. As the result, with the help of Search Analytics you get actual, real feedback about any changes you make to your search engine. Without a tool like this, you cannot really tune your search engine’s relevance well and will be doing it blindly.
Particularly the line:
That is, users are clicking on search hits closer to the top of the search results page, which is one of several signals of this Dismax handler providing better relevance ranking than the other one.
Really?
Here is one way to test that assumption:
Report for any search as the #1 or #2 result, “private cell-phone number for …” and pick one of the top ten movie actresses for 2011. And you can do better than that, make sure the cell-phone number is one that rings at your search analytics desk. Now see how many users are “…clicking on search hits closer to the top of the search results page….”
Are your results more relevant than a movie star?
Don’t get me wrong, search analytics are very important, but let’s not get carried away about what we can infer from largely opaque actions.
Some other questions: Did users find the information they needed? Can they make use of that information? Does that use improve some measurable or important aspect of the company business? Let’s broaden search analytics to make search results less opaque.
Mapping the Iowa caucus results: how it’s done with R
David Smith writes:
If you’ve been following the presidential primary process here in the US, you’ve probably seen many maps of the results of the Iowa caucuses by now (such as this infamous one from Fox News). But you might be interested to learn how such maps can be made using the R language.
BTW, David includes pointers to Offensive Politics, which self-describes as:
offensive politics uses technology and math to help progressives develop strategy, raise money and target voters to win elections.
A number of interesting projects and data sets that could be used with topic maps.
Other sources of political data, techniques or software?
Kaiser Fung writes:
Megan McArdle (The Atlantic) is starting a war on the infographics plague. (Here, infographics means infographics posters.) Excellent debunking, and absorbing reading.
It’s a long post. Her overriding complaint is that designers of these posters do not verify their data. The “information” shown on these charts is frequently inaccurate, and the interpretation is sloppy.
In the Trifecta checkup framework, this data deficiency breaks the link between the intent of the graphic and the (inappropriate) data being displayed. (Most infographics posters also fail to find the right chart type for the data being displayed.)
There are two reasons to read this post and then to follow up with Megan’s:
First, it may (no guarantees) sharpen your skills at detecting infographics that are misleading, fraudulent or simply wrong.
Second, if you want to learn how to make effective and misleading, fraudulent or simply wrong infographics, Megan’s article is a starting place with examples.
Nice article on predictive analytics in insurance
James Taylor writes:
Patrick Sugent wrote a nice article on A Predictive Analytics Arsenal in claims magazine recently. The article is worth a read and, if this is a topic that interests you check out our white paper on next generation claims systems or the series of blog posts on decision management in insurance that I wrote after I did a webinar with Deb Smallwood (an insurance industry expert quoted in the article).
The article is nice but I thought the white paper was better. Particularly this passage:
Next generation claims systems with Decision Management focus on the decisions in the claims process. These decisions are managed as reusable assets and made widely available to all channels, processes and systems via Decision Services. A decision-centric approach enables claims feedback and experience to be integrated into the whole product life cycle and brings the company’s know-how and expertise to bear at every step in the claims process.
At the heart of this new mindset is an approach for replacing decision points with Decision Services and improving business performance by identifying the key decisions that drive value in the business and improving on those decisions by leveraging a company’s expertise, data and existing systems.
Insurers are adopting Decision Management to build next generation claims systems that improve claims processes.
In topic map lingo, “next generation claims systems” are going to treat decisions as subjects that can be identified and re-used to improve the process.
Decisions are made everyday in claims processing but, current systems don’t identify them as subjects and so re-use simply isn’t possible.
True enough the proposal in the white paper does not allow for merging of decisions identified by others, but that doesn’t look like a requirement in their case. They need to be able to identify decisions they make and feed them back into their systems.
The other thing I liked about the white paper was the recognition that hard coding decision rules by IT is a bad idea. (full stop) You can take that one to the bank.
Of course, remember what James says about changes:
Most policies and regulations are written up as requirements and then hard-coded after waiting in the IT queue, making changes slow and costly.
But he omits that hard-coding empowers IT because any changes have to come to IT for implementation.
Making changes possible by someone other than IT, will empower that someone else and diminish IT.
Who knows what and when do they get to know it is a question of power.
Topic maps and other means of documentation/disclosure, have the potential to shift balances of power in an organization.
May as well say that up front so we can start identifying the players, who will cooperate, who will resist. And experimenting with what might work as incentives to promote cooperation. Which can be measured just like you measure other processes in a business.
Cynthia Murrell of BeyondSearch writes:
Wired Enterprise gives us a glimpse into MapR, a new distribution for Apache Hadoop, in “Ex-Google Man Sells Search Genius to Rest of World.” The ex-Googler in this case is M.C. Srivas, who was so impressed with Google’s MapReduce platform that he decided to spread its concepts to the outside world.
Sounds great! So I head over to the MapR site and choose Unique Features of MapR Hadoop Distribution, where I find:
Maybe I am missing it. Do you see any Search Genius in that list?
MapR may have improved the usability/reliability of Hadoop, which is no small thing, but disappointing when looking for better search results.
Let’s represent the original Hadoop with this Wikipedia image:
and the MapR version of Hadoop with this Wikipedia image:
It is true that the MapR version has more unique features but none of them appear to relate to search.
I am sure that Hadoop cluster managers and others will be interested in MapR (as will some of the rest of us), as managers.
As searchers, we may have to turn somewhere else. Do you disagree?
PS: Cloudera has made more contributions to the Hadoop and Apache communities than I can list in a very long post. Keep than in mind when you see ill-mannered and juvenile sniping at their approach to Hadoop.
From the post:
A first glimpse of how AIF is supporting interchange on the Argument Web
Prototype development on infrastructure and basic tools has reached the point where we can get a first glimpse of how the Argument Web will support a wide range of argument-related practice online. The video shows how different argument analysis tools can interact with each other, and how tools for analysis can work in harmony with tools for argument authoring and debate.
All the software is currently available, and going through some final testing before release. Later on in January, we will open access to the AIF database, and the first set of import/export filters. Then in February, we will release a public beta of the first practical Argument Web tool: FireBack, a Firefox plugin for argublogging. Tools for debate, analysis and automated computation will then follow later in the Spring.
I must admit to being curious what “argublogging” looks like. I suspect it will have a remarkable resemblance to what we call “flame wars” on email discussion lists.
Jack Park, who forwarded this link, assures me that there are other forms of argumentation, sometimes using the term “dialogue.” I don’t doubt that to be true, but how common it is in fact? I have my doubts.
If I were to watch any of the political “debates” for the U.S. presidential election, I would assure Jack that “debates” they were not. Incivility, lying, false factual claims, non-responsiveness, all with the goal of saying what they came to say, would be a better characterization. And that is from just reading the newspaper accounts. (Easier to skim and so not to waste time on being mis-informed by the candidates.) A Magic 8-Ball would be a better source of answers for public policy.
Distributed Indexing – SolrCloud
Not for the faint of heart but I noticed that progress is being made on distributed indexing for the SolrCloud project.
Whether you are a hard core coder or someone who is interested in using this feature (read feedback), now would be a good time to start paying attention to this work.
I added a new category for “Distributed Indexing” because this isn’t only going to come up for Solr. And I suspect there are aspects of “distributed indexing” that are going to be applicable to distributed topic maps as well.
Caching in HBase: SlabCache by Li Pi.
From the post:
The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.
However, despite the availability of high memory servers, the garbage collection algorithms available on production quality JDK’s have not caught up. Attempting to use large amounts of heap will result in the occasional stop-the-world pause that is long enough to cause stalled requests and timeouts, thus noticeably disrupting latency sensitive user applications.
Introduces management of the file system cache for those with loads and memory to justify and enable it.
Quite interesting work, particularly if you are ignoring the nay-sayers about the adoption of Hadoop and the Cloud in the coming year.
What the nay-sayers are missing is that yes, unimaginative mid-level managers and admins have no interest in Hadoop or the Cloud. What Hadoop and the Cloud present are opportunities that imaginative re-packagers and re-processing startups are going to use to provide new data streams and services.
Can’t ask startups that don’t exist yet why they have chosen to go with Hadoop and the Cloud.
That goes unnoticed by unimaginative commentators who reflect the opinions of uninformed managers, whose opinions are confirmed by the publication of the columns by unimaginative commentators. One of those feedback loops I mentioned earlier today.
Statistical Rules of Thumb, Part III – Always Visualize the Data
From the post:
As I perused Statistical Rules of Thumb again, as I do from time to time, I came across this gem. (note: I live in CA, so get no money from these amazon links).
Van Belle uses the term “Graph” rather than “Visualize”, but it is the same idea. The point is to visualize in addition to computing summary statistics. Summaries are useful, but can be deceiving; any time you summarize data you will lose some information unless the distributions are well behaved. The scatterplot, histogram, box and whiskers plot, etc. can reveal ways the summaries can fool you. I’ve seen these as well, especially variables with outliers or that are bi- or tri-modal.
What techniques do you use in visualizing topic maps? Such as hiding topics or associations? Or coloring schemes that appear to work better than others? Or do you integrate the information delivered by the topic map with other visualizations? Such as street maps, blueprints or floor plans?
Post seen at: Data Mining and Predictive Analytics
From the post:
Talend has been around for about 6 years and the original focus was on “democratizing” data integration – making it cheaper, easier, quicker and less maintenance-heavy. They originally wanted to build an open source alternative for data integration. In particular they wanted to make sure that there was a product that worked for smaller companies and smaller projects, not just for large data warehouse efforts.
Talend has 400 employees in 8 countries and 2,500 paying customers for their Enterprise product. Talend uses an “open core” philosophy where the core product is open source and the enterprise version wraps around this as a paid product. They have expanded from pure data integration into a broader platform with data quality and MDM and a year ago they acquired an open source ESB vendor and earlier this year released a Talend branded version of this ESB.
I have the Talend software but need to spend some time working through the tutorials, etc.
A review from a perspective of subject identity and re-use of subject identification.
It may help me to simply start posting as I work through the software rather than waiting to create an edited review of the whole. Which I could always fashion from the pieces if it looked useful.
Watch for the start of my review of Talend this next week.
Stan: A Bayesian Directed Graphical Model Compiler by Bob Carpenter.
I (Bob) am going to give a talk at the next NYC Machine Learning Meetupon 19 January 2012 at 7 PM:
There’s an abstract on the meetup site. The short story is that Stan’s a directed graphical model compiler (like BUGS) that uses adaptive Hamiltonian Monte Carlo sampling to estimate posterior distributions for Bayesian models.
The official version 1 release is coming up soon, but until then, you can check out our work in progress at:
- Google Code: Stan.
If you are in New York on the 19th of this month, please attend and post a note about the meeting.
Otherwise, play with “Stan” while we await the next release.
The Variation Toolkit by Pierre Lindenbaum.
From the post:
During the last weeks, I’ve worked on an experimental C++ package named The Variation Toolkit (varkit). It was originally designed to provide some command lines equivalent to knime4bio but I’ve added more tools over time. Some of those tools are very simple-and-stupid ( fasta2tsv) , reinvent the wheel (“numericsplit“), are part of an answer to biostar, are some old tools (e.g. bam2wig) that have been moved to this package, but some others like “samplepersnp“, “groupbygene” might be useful to people.
The package is available at : http://code.google.com/p/variationtoolkit/.
See the post for documentation.
Why Free Services Are Undervalued
From the post:
Open source adherents take heed. I stumbled upon a interesting post where blogger Tyler Nichols lamented the way that customers mistreat and inherently devalue free services in the article “I am Done with the Freemium Business Model.”
According to the post, Nichols obtained this opinion after creating a free Letter from Santa site over this Christmas holiday. Despite the 1,000,000 page views and 50,000 free Santa letters created, Nichols noticed that his customers refused to follow simple directions and fagged his follow up thank you letter as spam.
I didn’t see the FAQ but user requests for help may reflect on the UI design.
I think users need to see “free” software or information (think blogs, ;-)) as previews of what awaits paying customers.
Fractals in Science, Engineering and Finance (Roughness and Beauty) by Benoit B. Mandelbrot.
About the lecture:
Roughness is ubiquitous and a major sensory input of Man. The first step to measure and simulate it was provided by fractal geometry. Illustrative examples will be drawn from the sciences, engineering (the internet) and (more extensively) the variation of financial prices. The beauty of fractals, an unanticipated “premium,” helps in teaching and bridges some chasms between different aspects of knowing and feeling.
Mandelbrot summaries his career as the pursuit of a theory of roughness.
Discusses the use of the eye as well as the ear in discovery (which I would call identification) of phenomena.
Have you listened to one of your subject identifications lately?
Are subject identifications rough? Or are they the smoothing of roughness?
Do your subjects have self-similarity?
Definitely worth your time.
First seen at: Benoît B. Mandelbrot: Fractals in Science, Engineering and Finance (Roughness and Beauty) over at Computational Legal Studies.
Network Analysis and Law: Introductory Tutorial @ Jurix 2011 Meeting
Slides from a tutorial given by Daniel Martin Katz at the Jurix 2011 meeting.
Runs 317 slides but is awash in links to resources and software.
You will either learn a lot about network analysis or if you already know network analysis, you will be entertained and informed.
Saw it referenced at: http://computationallegalstudies.com/
The feedback economy Companies that employ data feedback loops are poised to dominate their industries. by Alistair Croll.
From the post:
Military strategist John Boyd spent a lot of time understanding how to win battles. Building on his experience as a fighter pilot, he broke down the process of observing and reacting into something called an Observe, Orient, Decide, and Act (OODA) loop. Combat, he realized, consisted of observing your circumstances, orienting yourself to your enemy’s way of thinking and your environment, deciding on a course of action, and then acting on it.
[graphic omitted, but it is interesting. Go to Croll’s post to see it.]
The most important part of this loop isn’t included in the OODA acronym, however. It’s the fact that it’s a loop. The results of earlier actions feed back into later, hopefully wiser, ones. Over time, the fighter “gets inside” their opponent’s loop, outsmarting and outmaneuvering them. The system learns.
Boyd’s genius was to realize that winning requires two things: being able to collect and analyze information better, and being able to act on that information faster, incorporating what’s learned into the next iteration. Today, what Boyd learned in a cockpit applies to nearly everything we do.
Information is important but so is the use of information in the form of feedback.
But all systems, even information systems generate feedback.
The question is: Does your system (read topic map) hear feedback? Perhaps more importantly, does it adapt based upon feedback it hears?
Katta – Lucene & more in the cloud
From the webpage:
Katta is a scalable, failure tolerant, distributed, data storage for real time access.
Katta serves large, replicated, indices as shards to serve high loads and very large data sets. These indices can be of different type. Currently implementations are available for Lucene and Hadoop mapfiles.
- Makes serving large or high load indices easy
- Serves very large Lucene or Hadoop Mapfile indices as index shards on many servers
- Replicate shards on different servers for performance and fault-tolerance
- Supports pluggable network topologies
- Master fail-over
- Fast, lightweight, easy to integrate
- Plays well with Hadoop clusters
- Apache Version 2 License
Now that the “new” has worn off of your holiday presents, ;-), something to play with over the weekend.
ReadMe: Software for Automated Content Analysis by Daniel Hopkins, Gary King, Matthew Knowles, and Steven Melendez.
From the homepage:
The ReadMe software package for R takes as input a set of text documents (such as speeches, blog posts, newspaper articles, judicial opinions, movie reviews, etc.), a categorization scheme chosen by the user (e.g., ordered positive to negative sentiment ratings, unordered policy topics, or any other mutually exclusive and exhaustive set of categories), and a small subset of text documents hand classified into the given categories. If used properly, ReadMe will report, normally within sampling error of the truth, the proportion of documents within each of the given categories among those not hand coded. ReadMe computes quantities of interest to the scientific community based on the distribution within categories but does so by skipping the more error prone intermediate step of classifing individual documents. Other procedures are also included to make processing text easy.
Just in case you tire of hand tagging documents before further processing for feeding into a topic map.
Quite interesting even if it doesn’t address the primary weaknesses in semantic annotation.
Semantic annotation presently is:
Rather than ranting at the mountain of legacy data as too complex, large, difficult, etc., to adequately annotate, why not turn our attention to the present day creation of data?
Imagine if all the copies of MS™ Word, OpenOffice for every document they produced today, did something as simply as insert a metadata pointer to a vocabulary for that document. Could even have defaults for all the documents created by particular offices or divisions. So that when search engines search those documents, they can use the declared vocabularies for search and disambiguation purposes.
ODF 1.2 already has that capacity and one hopes MS™ would follow that lead and use the same technique to avoid creating extra work for search engines.
Would not be all data, would not even fully annotate all the data in those documents.
But it would be a start towards creating smarter documents but creating smarter documents at the outset, at the instigation of their authors. The people who cared enough to author them are much better choices to declare their meanings.
As we develop better techniques, such as ReadMe and/or when ROI is present, we can then address legacy data issues.
General Purpose Computer-Assisted Clustering and Conceptualization by Justin Grimmer and Gary King.
Abstract:
We develop a computer-assisted method for the discovery of insightful conceptualizations, in the form of clusterings (i.e., partitions) of input objects. Each of the numerous fully automated methods of cluster analysis proposed in statistics, computer science, and biology optimize a different objective function. Almost all are well defined, but how to determine before the fact which one, if any, will partition a given set of objects in an “insightful” or “useful” way for a given user is unknown and difficult, if not logically impossible. We develop a metric space of partitions from all existing cluster analysis methods applied to a given data set (along with millions of other solutions we add based on combinations of existing clusterings), and enable a user to explore and interact with it, and quickly reveal or prompt useful or insightful conceptualizations. In addition, although uncommon in unsupervised learning problems, we offer and implement evaluation designs that make our computer-assisted approach vulnerable to being proven suboptimal in specific data types. We demonstrate that our approach facilitates more efficient and insightful discovery of useful information than either expert human coders or many existing fully automated methods.
Despite my misgivings about metric spaces for semantics, the central theme that clustering (dare I say merging?) cannot be determined in advance of some user viewing the data, makes sense to me. Not every user will want or perhaps even need to do interactive clustering but I think this theme represents a substantial advance in this area.
The publication appeared in the Proceeding of the National Academy of Sciences of the United States of America and the authors are from Stanford and Harvard, respectively. Institutions that value technical and scientific brilliance.
I-CHALLENGE 2012 : Linked Data Cup
Dates:
When Sep 5, 2012 – Sep 7, 2012
Where Graz, Austria
Submission Deadline Apr 2, 2012
Notification Due May 7, 2012
Final Version Due Jun 4, 2012
From the call for submissions:
The yearly organised Linked Data Cup (formerly Triplification Challenge) awards prizes to the most promising innovation involving linked data. Four different technological topics are addressed: triplification, interlinking, cleansing, and application mash-ups. The Linked Data Cup invites scientists and practitioners to submit novel and innovative (5 star) linked data sets and applications built on linked data technology.
Although more and more data is triplified and published as RDF and linked data, the question arises how to evaluate the usefulness of such approaches. The Linked Data Cup therefore requires all submissions to include a concrete use case and problem statement alongside a solution (triplified data set, interlinking/cleansing approach, linked data application) that showcases the usefulness of linked data. Submissions that can provide measurable benefits of employing linked data over traditional methods are preferred.
Note that the call is not limited to any domain or target group. We accept submissions ranging from value-added business intelligence use cases to scientific networks to the longest tail of information domains. The only strict requirement is that the employment of linked data is very well motivated and also justified (i.e. we rank approaches higher that provide solutions, which could not have been realised without linked data, even if they lack technical or scientific brilliance). (emphasis added)
I don’t know what the submissions are going to look like but the conference organizers should get high marks for academic honesty. I don’t think I have ever seen anyone say:
we rank approaches higher that provide solutions, which could not have been realised without linked data, even if they lack technical or scientific brilliance
We have all seen challenges with qualifying requirements but I don’t recall any that would privilege lesser work because of a greater dependence on a requirement. Or at least that would publicly claim that was the contest policy. Have there been complaints from technically or scientifically brilliant approaches about judging in the past?
Will have to watch the submissions and results to see if technically or scientifically brilliant approaches get passed over in favor of lesser approaches. Will be a signal to all first rate competitors to seek recognition elsewhere.
Sixth IEEE International Conference on Semantic Computing
Dates:
When Sep 19, 2012 – Sep 21, 2012
Where Palermo, Italy
Abstract Registration Due Jul 15, 2012
Submission Deadline May 4, 2012
Notification Due Jun 28, 2012
From the call for papers:
Semantic Computing addresses technologies that facilitate the derivation of semantics from content and connecting semantics into knowledge, where “content” may be anything such as video, audio, text, conversation, process, program, device, behavior, etc.
The Sixth IEEE International Conference on Semantic Computing (ICSC2012) continues to foster the growth of a new research community. The conference builds on the success of the past ICSC conferences as an international forum for researchers and practitioners to present research that advances the state of the art and practice of Semantic Computing, as well as identifying emerging research topics and defining the future of the field. The event is located in Palermo, Italy on the Piazza borsa hotel. The technical program of ICSC2012 includes workshops, invited keynotes, paper presentations, panel discussions, industrial “show and tells”, demonstrations, and more. Submissions of high-quality papers describing mature results or on-going work are invited.
The main goal of the conference is to foster the dialog between experts in each sub-discipline. Therefore we especially encourage submissions of work that is interesting to multiple areas, such as multimodal approaches.
I have seen there are some slides from prior meetings. Would be curious to know what vocabulary analysis on the papers would show about this differences between experts over time. Are the participants coming closer to using a common vocabulary?
Powered by WordPress