Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 8, 2012

Patent war

Filed under: Graphics,News,Patents,Visualization — Patrick Durusau @ 7:11 pm

Patent war by Nathan Yau.

Nathan points to research and visualizations by the New York Times of the ongoing patent war between Apple and Samsung.

An ideal outcome would be for the principals and their surrogates to be broken by litigation costs and technology patents rendered penny stock value by the litigation.

You can move the system towards that outcome by picking a patent and creating a topic map starting with that patent.

The more data the litigants have, the more they will think they need.

Let’s let them choke on it.

Are You Confused? (About MR2 and YARN?) Help is on the way!

Filed under: Hadoop,Hadoop YARN,MapReduce 2.0 — Patrick Durusau @ 6:51 pm

MR2 and YARN Briefly Explained by Justin Kestelyn.

Justin writes:

With CDH4 onward, the Apache Hadoop component introduced two new terms for Hadoop users to wonder about: MR2 and YARN. Unfortunately, these terms are mixed up so much that many people are confused about them. Do they mean the same thing, or not?

Not but see Justin’s post for the details. (He also points to a longer post with more details.)

Wolfram Data Summit 2012 Presentations [Elves and Hypergraphs = Topic Maps?]

Filed under: Combinatorics,Conferences,Data,Data Mining — Patrick Durusau @ 1:39 pm

Wolfram Data Summit 2012 Presentations

Presentations have been posted from the Wolfram Data Summit 2012:

I looked at:

“The Trouble with House Elves: Computational Folkloristics, Classification, and Hypergraphs” Timothy Tangherlini, Professor, UCLA James Abello, Research Professor, DIMACS – Rutgers University

first. 😉

Would like to see a video of the presentation. Pointers anyone?

Close as I can imagine to being a topic map without using the phrase “topic map.”

Others?

Thursday, September 8

  • Presentation “Who’s Bigger? A Quantitative Analysis of Historical Fame” Steven Skiena, Professor, Stony Brook University
  • Presentation “Academic Data: A Funder’s Perspective” Myron Gutmann, Assistant Director, Social, Behavioral & Economic Sciences, National Science Foundation (NSF)
  • Presentation “Who Owns the Law?” Ed Walters, CEO, Fastcase, Inc.
  • Presentation “An Initiative to Improve Academic and Commercial Data Sharing in Cancer Research” Charles Hugh-Jones, Vice President, Head, Medical Affairs North America, Sanofi
  • Presentation “The Trouble with House Elves: Computational Folkloristics, Classification, and Hypergraphs” Timothy Tangherlini, Professor, UCLA James Abello, Research Professor, DIMACS – Rutgers University
  • Presentation “Rethinking Digital Research” Kaitlin Thaney, Manager, External Partnerships, Digital Science
  • Presentation “Building and Learning from Social Networks” Chris McConnell, Principal Software Development Lead, Microsoft Research FUSE Labs
  • Presentation “Keeping Repositories in Synchronization: NISO/OAI ResourceSync Project” Todd Carpenter, Executive Director, NISO
  • Presentation “A New, Searchable SDMX Registry of Country-Level Health, Education, and Financial Data” Chris Dickey, Director, Research and Innovations, DevInfo Support Group
  • Presentation “Dryad’s Evolving Proof of Concept and the Metadata Hook” Jane Greenberg, Professor, School of Information and Library Science (SILS), University of North Carolina at Chapel Hill
  • Presentation “How the Associated Press Tabulates and Distributes Votes in US Elections” Brian Scanlon, Director of Election Services, The Associated Press
  • Presentation “How Open Is Open Data?” Ian White, President, Urban Mapping, Inc.
  • Presentation “No More Tablets of Stone: Enabling the User to Weight Our Data and Shape Our Research” Toby Green, Head of Publishing, Organisation for Economic Co-operation and Development (OECD)
  • Presentation “Sharing and Protecting Confidential Data: Real-World Examples” Timothy Mulcahy, Principal Research Scientist, NORC at the University of Chicago
  • Presentation “Language Models That Stimulate Creativity” Matthew Huebert, Programmer/Designer, BrainTripping
  • Presentation “The Analytic Potential of Long-Tail Data: Sharable Data and Reuse Value” Carole Palmer, Center for Informatics Research in Science & Scholarship, University of Illinois at Urbana-Champaign
  • Presentation “Evolution of the Storage Brain—Using History to Predict the Future” Larry Freeman, Senior Technologist, NetApp, Inc.

Friday, September 9

  • Presentation “Devices, Data, and Dollars” John Burbank, President, Strategic Initiatives, The Nielsen Company
  • Presentation “Pulling Structured Data Out of Unstructured” Greg Lindahl, CTO, blekko
  • Presentation “Mining Consumer Data for Insights and Trends” Rohit Chauhan, Group Executive, MasterCard Worldwide
  • Presentation
    “Data Quality and Customer Behavioral Modeling” Daniel Krasner, Chief Data Scientist, Sailthru/KFit Solutions
  • No presentation available. “Human-Powered Analysis with Crowdsourcing and Visualization” Edwin Chen, Data Scientist, Twitter
  • Presentation “Leveraging Social Media Data as Real-Time Indicators of X” Maria Singson, Vice President, Country and Industry Research & Forecasting, IHS Chris Hansen, Director, IHS Dan Bergstresser, Chief Economist, Janys Analytics
  • No presentation available. “Visualizations in Yelp” Jim Blomo, Engineering Manager, Data-Mining, Yelp
  • Presentation “The Digital Footprints of Human Activity” Stanislav Sobolevsky, MIT SENSEable City Lab
  • Presentation “Unleash Your Research: The Wolfram Data Repository” Matthew Day, Manager, Data Repository, Wolfram Alpha LLC
  • Presentation “Quantifying Online Discussion: Unexpected Conclusions from Mass Participation” Sascha Mombartz, Creative Director, Urtak
  • Presentation “Statistical Physics for Non-physicists: Obesity Spreading and Information Flow in Society” Hernán Makse, Professor, City College of New York
  • Presentation “Neuroscience Data: Past, Present, and Future” Chinh Dang, CTO, Allen Institute for Brain Science
  • Presentation “Finding Hidden Structure in Complex Networks” Yong-Yeol Ahn, Assistant Professor, Indiana University Bloomington
  • Presentation “Data Challenges in Health Monitoring and Diagnostics” Anthony Smart, Chief Science Officer, Scanadu
  • No presentation available. “Datascience Automation with Wolfram|Alpha Pro” Taliesin Beynon, Manager and Development Lead, Wolfram Alpha LLC
  • Presentation “How Data Science, the Web, and Linked Data Are Changing Medicine” Joanne Luciano, Research Associate Professor, Rensselaer Polytechnic Institute
  • Presentation “Unstructured Data and the Role of Natural Language Processing” Philip Resnik, Professor, University of Maryland
  • Presentation “A Framework for Measuring Social Quality of Content Based on User Behavior” Nanda Kishore, CTO, ShareThis, Inc.
  • Presentation “The Science of Social Data” Hilary Mason, Chief Scientist, bitly
  • Presentation “Big Data for Small Languages” Laura Welcher, Director of Operations, The Rosetta Project
  • Presentation “Moving from Information to Insight” Anthony Scriffignano, Senior Vice President, Worldwide Data & Insight, Dun and Bradstreet

PS: I saw this in Christophe Lalanne’s A bag of tweets / September 2012 and reformatted the page to make it easier to consult.

Visualizing.org

Filed under: Graphics,Visualization — Patrick Durusau @ 12:38 pm

Visualizing.org

From the about page:

Visualizing.org is a community of creative people making sense of complex issues through data and design… and a shared space and free resource to help you achieve this goal.

Definitely “a community of creative people.”

Whether sense is being made, I leave for you to decide. 😉

If you appreciate or produce visualizations, you will enjoy this site.

I discovered this following links from Christophe Lalanne’s A bag of tweets / September 2012. The specific visualizations were impressive but the home site even more so.

Are Expert Semantic Rules so 1980’s?

In The Geometry of Constrained Structured Prediction: Applications to Inference and Learning of Natural Language Syntax André Martins proposes advances in inferencing and learning for NLP processing. And it is important work for that reason.

But in his introduction to recent (and rapid) progress in language technologies, the following text caught my eye:

So, what is the driving force behind the aforementioned progress? Essentially, it is the alliance of two important factors: the massive amount of data that became available with the advent of the Web, and the success of machine learning techniques to extract statistical models from the data (Mitchell, 1997; Manning and Schötze, 1999; Schölkopf and Smola, 2002; Bishop, 2006; Smith, 2011). As a consequence, a new paradigm has emerged in the last couple of decades, which directs attention to the data itself, as opposed to the explicit representation of knowledge (Abney, 1996; Pereira, 2000; Halevy et al., 2009). This data-centric paradigm has been extremely fruitful in natural language processing (NLP), and came to replace the classic knowledge representation methodology which was prevalent until the 1980s, based on symbolic rules written by experts. (emphasis added)

Are RDF, Linked Data, topic maps, and other semantic technologies caught in a 1980’s “symbolic rules” paradigm?

Are we ready to make the same break that NLP did, what, thirty (30) years ago now?

To get started on the literature, consider André’s sources:

Abney, S. (1996). Statistical methods and linguistics. In The balancing act: Combining symbolic and statistical approaches to language, pages 1–26. MIT Press, Cambridge, MA.

A more complete citation: Steven Abney. Statistical Methods and Linguistics. In: Judith Klavans and Philip Resnik (eds.), The Balancing Act: Combining Symbolic and Statistical Approaches to Language. The MIT Press, Cambridge, MA. 1996. (Link is to PDF of Abney’s paper.)

Pereira, F. (2000). Formal grammar and information theory: together again? Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 358(1769):1239–1253.

I added a pointer to the Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences abstract for the article. You can see it at: Formal grammar and information theory: together again? (PDF file).

Halevy, A., Norvig, P., and Pereira, F. (2009). The unreasonable effectiveness of data. Intelligent Systems, IEEE, 24(2):8–12.

I added a pointer to the Intelligent Systems, IEEE abstract for the article. You can see it at: The unreasonable effectiveness of data (PDF file).

The Halevy article doesn’t have an abstract per se but the ACM reports one as:

Problems that involve interacting with humans, such as natural language understanding, have not proven to be solvable by concise, neat formulas like F = ma. Instead, the best approach appears to be to embrace the complexity of the domain and address it by harnessing the power of data: if other humans engage in the tasks and generate large amounts of unlabeled, noisy data, new algorithms can be used to build high-quality models from the data. [ACM]

That sounds like a challenge to me. You?

PS: I saw the pointer to this thesis at Christophe Lalanne’s A bag of tweets / September 2012

Layout-aware text extraction from full-text PDF of scientific articles

Filed under: PDF,Text Extraction,Text Mining — Patrick Durusau @ 9:24 am

Layout-aware text extraction from full-text PDF of scientific articles by Cartic Ramakrishnan, Abhishek Patnia, Eduard Hovy and Gully APC Burns. (Source Code for Biology and Medicine 2012, 7:7 doi:10.1186/1751-0473-7-7)

Abstract:

Background

The Portable Document Format (PDF) is the most commonly used file format for online scientific publications. The absence of effective means to extract text from these PDF files in a layout-aware manner presents a significant challenge for developers of biomedical text mining or biocuration informatics systems that use published literature as an information source. In this paper we introduce the ‘Layout-Aware PDF Text Extraction’ (LA-PDFText) system to facilitate accurate extraction of text from PDF files of research articles for use in text mining applications.

Results

Our paper describes the construction and performance of an open source system that extracts text blocks from PDF-formatted full-text research articles and classifies them into logical units based on rules that characterize specific sections. The LA-PDFText system focuses only on the textual content of the research articles and is meant as a baseline for further experiments into more advanced extraction methods that handle multi-modal content, such as images and graphs. The system works in a three-stage process: (1) Detecting contiguous text blocks using spatial layout processing to locate and identify blocks of contiguous text, (2) Classifying text blocks into rhetorical categories using a rule-based method and (3) Stitching classified text blocks together in the correct order resulting in the extraction of text from section-wise grouped blocks. We show that our system can identify text blocks and classify them into rhetorical categories with Precision1 = 0.96% Recall = 0.89% and F1 = 0.91%. We also present an evaluation of the accuracy of the block detection algorithm used in step 2. Additionally, we have compared the accuracy of the text extracted by LA-PDFText to the text from the Open Access subset of PubMed Central. We then compared this accuracy with that of the text extracted by the PDF2Text system, 2commonly used to extract text from PDF. Finally, we discuss preliminary error analysis for our system and identify further areas of improvement.

Conclusions

LA-PDFText is an open-source tool for accurately extracting text from full-text scientific articles. The release of the system is available at http://code.google.com/p/lapdftext/.

Scanning TOCs from a variety of areas can uncover goodies like this one.

What is the most recent “unexpected” paper/result outside your “field” have you found?

Peer2ref: A new online tool for locating peer reviewers

Filed under: Peer Review,Searching — Patrick Durusau @ 9:06 am

Peer2ref: A new online tool for locating peer reviewers by Jack Cochrane.

From the post:

Findings published in the peer reviewed journal BioData Mining…” A sentence like this instantly adds credibility to a scientific article. But it isn’t simply the name of a prestigious journal that assures readers of an article’s validity; it’s the knowledge that the research has been peer reviewed.

Peer review, the process by which scientists critically evaluate their colleagues’ methods and findings, has been essential to scientific discourse for centuries. In those early days of scientific research, with fewer journals and lower levels of specialization, scientists found it relatively easy to devote their time to assessing new findings. However as the pace of research has expanded, so too has the number of articles and the number of journals set up to publish them. Scientists, already faced with increasingly full to-do-lists, have struggled to keep up.

Exacerbating this problem is the specialization of many articles, which now come from increasingly narrow fields of research. This expansion of the body of scientific knowledge and the resulting compartmentalization of many research fields means that locating qualified peer reviewers can be a major challenge.

Jack points to software developed by Miguel A Andrade-Navarro et al that can help solve the finding peer reviewers problem.

From his description of the software:

This allows users to search for authors and editors in specific fields using keywords related to the subject an article, making Peer2ref highly effective at finding experts in narrow fields of research.

Does “narrow field of research” sound appropriate for a focused topic map effort?

Identifying the experts in an area would be a good first step.

I first saw this in Christophe Lalanne’s A bag of tweets / September 2012

October 7, 2012

The Forgotten Mapmaker: Nokia… [Lessons for Semantic Map Making?]

Filed under: Mapping,Maps,Semantics — Patrick Durusau @ 7:57 pm

The Forgotten Mapmaker: Nokia Has Better Maps Than Apple and Maybe Even Google by Alexis C. Madrigal.

What’s Nokia’s secret? Twelve billion probe data points a month, including data from FedEx and other logistic companies.

Notice that the logistic companies are not collecting mapping data per se, they are delivering goods.

Nokia is building maps based on data collected for another purpose, a completely routine and unrelated purpose to map making.

Does that suggest something to you about semantic map making?

That we need to capture semantics as users travel through data for other purposes?

If I knew what those opportunities were I would have put them at the top of this post. Suggestions?

PS: Sam Hunting pointed me towards this article.

Gagnam Style Hadoop Learning

Filed under: Hadoop,Humor — Patrick Durusau @ 7:43 pm

Gagnam Style Hadoop Learning

Err, you will just have to see this one. It…, defies description.

Not management appropriate, too many words. That would lead to questions.

Let’s start the week by avoiding management questions because of too many words in a video.

Broken Telephone Game of Defining Software and UI Requirements [And Semantics]

Filed under: Project Management,Requirements,Semantics — Patrick Durusau @ 7:33 pm

The Broken Telephone Game of Defining Software and UI Requirements by Martin Crisp.

Martin is writing in a UI context but the lesson he teaches is equally applicable to any part of software/project management. (Even U.S. federal government big data projects.)

His counsel is not one of dispair, he outlines solutions that can lessen the impact of the broken telephone game.

But it is up to you to recognize the game that is afoot and to react accordingly.

From the post:

The broken telephone game is played all over the world. In it, according to Wikipedia, “one person whispers a message to another, which is passed through a line of people until the last player announces the message to the entire group. Errors typically accumulate in the retellings, so the statement announced by the last player differs significantly, and often amusingly, from the one uttered by the first.”

This game is also played inadvertently by a large number of organizations seeking to define software and UI requirements, using information passed from customers, to business analysts, to UI/UX designers, to developers and testers.

Here’s a typical example:

  • The BA or product owner elicits requirements from a customer and writes them down, often as a feature list and use cases.
  • The use cases are interpreted by the UI/UX team to develop UI mockups and storyboards.
  • Testing interprets the storyboards, mockups, and use cases to develop test cases,
  • Also, the developers will try to interpret the use cases, mockups, and storyboards to actually write the code.

As with broken telephone, at each handoff of information the original content is altered. The resulting approach includes a lot of re-work and escalating project costs due to combinations of the following:

  • Use cases don’t properly represent customer requirements.
  • UI/UX design is not consistent with the use cases.
  • Incorrect test cases create false bugs.
  • Missed test cases result in undiscovered bugs.
  • Developers build features that don’t meet customer needs.

The further down the broken telephone line the original requirements get, the more distorted they become. For this reason, UI storyboards, test cases, and code typically require a lot of reworking as requirements are misunderstood or improperly translated by the time they get to the UI and testing teams.

Emacs Rocks!

Filed under: Editor — Patrick Durusau @ 7:23 pm

Emacs Rocks!

A series of very short (first one I saw was 1 minute, 54 seconds) screen casts of using Emacs.

A communication model to be emulated!

I first saw this at Christophe Lalanne’s A bag of tweets / September 2012

Parallel Coordinates [D3]

Filed under: D3,Graphics,Visualization — Patrick Durusau @ 6:11 pm

Parallel Coordinates [D3]

From the webpage:

A visual toolkit for multidimensional detectives.

Read paper at the link to “multidimensional detectives.”

What is at stake is visual exploration of data.

Are you the sort of person who sees patterns before others? Or patterns others miss even after you point them out?

You could have a profitable career in visual exploration of data. Seriously.

I first saw this at Christophe Lalanne’s A bag of tweets / September 2012.

Processing – 2nd Edition Beta

Filed under: Graphics,Processing,Visualization — Patrick Durusau @ 4:43 pm

Processing – 2nd Edition Beta

You know when you read:

THE [*******] SOFTWARE IS PROVIDED TO YOU “AS IS,” AND WE MAKE NO EXPRESS OR IMPLIED WARRANTIES WHATSOEVER WITH RESPECT TO ITS FUNCTIONALITY, OPERABILITY, OR USE, INCLUDING, WITHOUT LIMITATION, ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR INFRINGEMENT. WE EXPRESSLY DISCLAIM ANY LIABILITY WHATSOEVER FOR ANY DIRECT, INDIRECT, CONSEQUENTIAL, INCIDENTAL OR SPECIAL DAMAGES, INCLUDING, WITHOUT LIMITATION, LOST REVENUES, LOST PROFITS, LOSSES RESULTING FROM BUSINESS INTERRUPTION OR LOSS OF DATA, REGARDLESS OF THE FORM OF ACTION OR LEGAL THEORY UNDER WHICH THE LIABILITY MAY BE ASSERTED, EVEN IF ADVISED OF THE POSSIBILITY OR LIKELIHOOD OF SUCH DAMAGES.

By downloading the software from this page, you agree to the specified terms.

you are:

  • Downloading non-commercial beta software, or
  • Purchasing commercial release software.

Isn’t that odd? You would think the terms would be different.

The only thing you get different from a commercial warranty is a warrant they will come “get you” if you copy their software.

If you don’t know Processing:

Processing is an open source programming language and environment for people who want to create images, animations, and interactions. Initially developed to serve as a software sketchbook and to teach fundamentals of computer programming within a visual context, Processing also has evolved into a tool for generating finished professional work. Today, there are tens of thousands of students, artists, designers, researchers, and hobbyists who use Processing for learning, prototyping, and production.

I first saw the notice of this release on Christophe Lalanne’s A bag of tweets / September 2012

Stan (Bayesian Inference) [update]

Filed under: Bayesian Data Analysis,Bayesian Models,Statistics — Patrick Durusau @ 4:17 pm

Stan

From the webpage:

Stan is an open-source package for obtaining Bayesian inference using the No-U-Turn sampler, a variant of Hamiltonian Monte Carlo.

I first reported on a presentation: Stan: A (Bayesian) Directed Graphical Model Compiler last January when Stan was unreleased.

Following a link from Christophe Lalanne’s A bag of tweets / September 2012, I find the released version of the software!

Very cool!

Revisiting “Ranking the popularity of programming languages”: creating tiers

Filed under: Data Mining,Graphics,Statistics,Visualization — Patrick Durusau @ 4:05 pm

Revisiting “Ranking the popularity of programming languages”: creating tiers by Drew Conway.

From the post:

In a post on dataists almost two years ago, John Myles White and I posed the question: “How would you rank the popularity of a programming language?”.

From the original post:

One way to do so is to count the number of projects using each language, and rank those with the most projects as being the most popular. Another might be to measure the size of a language’s “community,” and use that as a proxy for its popularity. Each has their advantages and disadvantages. Counting the number of projects is perhaps the “purest” measure of a language’s popularity, but it may overweight languages based on their legacy or use in production systems. Likewise, measuring community size can provide insight into the breadth of applications for a language, but it can be difficult to distinguish among language with a vocal minority versus those that are actually have large communities.

So, we spent an evening at Princeton hacking around on Github and StackOverflow to get data on the number of projects and questions tagged, per programming language, respectively. The result was a scatter plot showing the linear relationship between these two measures. As with any post comparing programming languages, it was great bait for the Internet masses to poke holes in, and since then Stephen O’Grady at Redmonk has been re-running the analysis to show changes in the relative position of languages over time.

Today I am giving a talk at Monktoberfest on the importance of pursuing good questions in data science. As an example, I wanted to revisits the problem of ranking programming languages. For a long time I have been unsatisfied with the outcome of the original post, because the chart does not really address the original question about ranking.

I would not down play the importance of Drew’s descriptive analysis.

Until you can describe something, it is really difficult to explain it. 😉

Bigger Data and Smarter Scaling: Tickets Now Available [NY – 17th of Oct]

Filed under: BigData,News — Patrick Durusau @ 4:49 am

Bigger Data and Smarter Scaling: Tickets Now Available by Marci Windsheimer.

The New York Times, 15th floor
620 Eighth Avenue
New York, 10018

Wednesday, October 17, 2012 from 7:00 PM to 10:00 PM (EDT)

Description:

The third TimesOpen event of 2012 introduces us to the age of more. More data, more users, more everything. We’ll take a look at what’s being done with this wealth of information and how sites and apps are handling unprecedented volume.

The Major League Baseball schedule isn’t set for the 17th of October but you can always record the game if it conflicts with the meeting. 😉

New Congressional Data Available for Free Bulk Download: Bill Data 1973- , Members 1789-

Filed under: Government,Government Data,Law - Sources,Legal Informatics — Patrick Durusau @ 4:28 am

New Congressional Data Available for Free Bulk Download: Bill Data 1973- , Members 1789-

From Legal Informatics news of:

Of interest if you like U.S. history and/or recent events.

What other data would you combine with the data you find here?

October 6, 2012

It takes time: A remarkable example of delayed recognition

Filed under: Marketing,Peirce,Statistics — Patrick Durusau @ 6:27 pm

It takes time: A remarkable example of delayed recognition by Ben Van Calster. (Van Calster, B. (2012), It takes time: A remarkable example of delayed recognition. J. Am. Soc. Inf. Sci.. doi: 10.1002/asi.22732)

Abstract:

The way in which scientific publications are picked up by the research community can vary. Some articles become instantly cited, whereas others go unnoticed for some time before they are discovered or rediscovered. Papers with delayed recognition have also been labeled “sleeping beauties.” I briefly discuss an extreme case of a sleeping beauty. Peirce’s short note in Science in 1884 shows a remarkable increase in citations since around 2000. The note received less than 1 citation per year in the decades prior to 2000, 3.5 citations per year in the 2000s, and 10.4 in the 2010s. This increase was seen in several domains, most notably meteorology, medical prediction research, and economics. The paper outlines formulas to evaluate a binary prediction system for a binary outcome. This citation increase in various domains may be attributed to a widespread, growing research focus on mathematical prediction systems and the evaluation thereof. Several recently suggested evaluation measures essentially reinvented or extended Peirce’s 120-year-old ideas.

I would call your attention to the last line of the abstract:

Several recently suggested evaluation measures essentially reinvented or extended Peirce’s 120-year-old ideas.

I take that to mean with better curation of ideas, perhaps we would invent different ideas?

The paper ends:

To conclude, the simple ideas presented in Peirce’s note have been reinvented and rediscovered several decades or even more than a century later. It is fascinating that we arrive at ideas presented more than a century ago, and that Peirce’s ideas on the evaluation of predictions have come to the surface regularly across time and discipline. A saying, attributed to Ivan Pavlov, goes: “If you want new ideas, read old books.”

What old books are you going to read this weekend?

PS: Just curious. What search terms would you use, other than the author’s name and article title, to insure that you could find this article again? What about information across the various fields cited in the article to find related information?

mol2chemfig, a tool for rendering chemical structures… [Hard Copy Delivery of Topic Map Content]

Filed under: Cheminformatics,Graphics,Visualization — Patrick Durusau @ 5:02 pm

mol2chemfig, a tool for rendering chemical structures from molfile or SMILES format to LATEX code by Eric K Brefo-Mensah and Michael Palmer.

Abstract:

Displaying chemical structures in LATEX documents currently requires either hand-coding of the structures using one of several LATEX packages, or the inclusion of finished graphics files produced with an external drawing program. There is currently no software tool available to render the large number of structures available in molfile or SMILES format to LATEX source code. We here present mol2chemfig, a Python program that provides this capability. Its output is written in the syntax defined by the chemfig TEX package, which allows for the flexible and concise description of chemical structures and reaction mechanisms. The program is freely available both through a web interface and for local installation on the user?s computer. The code and accompanying documentation can be found at http://chimpsky.uwaterloo.ca/mol2chemfig.

Is there a presumption that topic map delivery systems are limited to computers?

Or that components in topic map interfaces have to snap, crackle or pop with every mouse-over?

While computers enable scalable topic map processing, processing should not be confused with delivery of topic map content.

If you are delivering information about chemical structures from a topic map into hard copy, you are likely to find this a useful tool.

Information is Beautiful Awards – The Results Are In!

Filed under: Graphics,Visualization — Patrick Durusau @ 4:20 pm

Information is Beautiful Awards – The Results Are In! by David McCandless.

From the post:

Last night, at a packed party venue in London, we announced the winners of the inaugral Information is Beautiful Awards. Thank you to all our amazing judges, supporters, staff and our ever generous sponsors Kantar. And the biggest high-five to the 1000+ talented people who courageously entered their work.

Apologies for not seeing this sooner (was posted on 28 September 2012).

I don’t pretend to have any graphic talent at all but I do appreciate well done visualizations.

Be forewarned, you can lose some serious time enjoying these awards! (You may learn something in the process. I suspect I have.)

Forbes: “Tokutek Makes Big Data Dance”

Filed under: BigData,Fractal Trees,MariaDB,MySQL,TokuDB,Tokutek — Patrick Durusau @ 4:04 pm

Forbes: “Tokutek Makes Big Data Dance” by Lawrence Schwartz.

From the post:

Recently, our CEO, John Partridge had a chance to talk about novel database technologies for “Big Data” with Peter Cohan of Forbes.

According to the article, “Fractal Tree indexing is helping organizations analyze big data more efficiently due to its ability to improve database efficiency thanks to faster ‘database insertion speed, quicker input/output performance, operational agility, and data compression.’” As a start-up based on “the first algorithm-based breakthrough in the database world in 40 years,” Toktuetek is following in the footsteps of firms such as Google and RSA, which also relied on novel algortithm advances as core to their technology.

To read the full article, and to see how Tokutek is helping companies tackle big data, see here.

I would ignore Peter Cohan’s mistakes about the nature of credit card processing. You don’t wait for the “ok” on your account balance.

Remember What if all transactions required strict global consistency? by Matthew Aslett of the 451 Group? Eventual consistency works right now.

I would have picked “hot schema” changes as a feature to highlight but that might not play as well with a business audience.

Webinar: Introduction to TokuDB v6.5 (Oct. 10, 2012)

Filed under: Fractal Trees,Indexing,MariaDB,MySQL,TokuDB — Patrick Durusau @ 3:37 pm

Webinar: Introduction to TokuDB v6.5

From the post:

TokuDB® is a proven solution that scales MySQL® and MariaDB® from GBs to TBs with unmatched insert and query speed, compression, replication performance and online schema flexibility. Tokutek’s recently launched TokuDB v6.5 delivers all of these features and more, not just for HDDs, but also for flash memory.

Date: October 10th
Time: 2 PM EST / 11 AM PST
REGISTER TODAY

TokuDB v6.5:

  • Stores 10x More Data – TokuDB delivers 10x compression without any performance degradation. Users can therefore take advantage of much greater amounts of available space without paying more for additional storage.
  • Delivers High Insertion Speed – TokuDB Fractal Tree® indexes continue to change the game with huge insertion rates and greater scalability. Our latest release delivers an order of magnitude faster insertion performance than the competition, ideal for applications that must simultaneously query and update large volumes of rapidly arriving data (e.g., clickstream analytics).
  • Allows Hot Schema Changes — Hot column addition/deletion/rename/resize provides the ability to add/drop/change a column to a database without taking the database offline, enabling database administrators to redefine or add new fields with no downtime.
  • Extends Wear Life for Flash– TokuDB’s proprietary Fractal Tree indexing writes fewer, larger blocks which reduces overall wear, and more efficiently utilizes the FTL (Flash Translation Layer). This extends the life of flash memory by an order of magnitude for many applications.

This webinar covers TokuDB features, latest performance results, and typical use cases.

You have seen the posts about fractal indexing! Now see the demos!

Applying Parallel Prediction to Big Data

Filed under: Hadoop,Mahout,Oracle,Pig,Weather Data,Weka — Patrick Durusau @ 3:20 pm

Applying Parallel Prediction to Big Data by Dan McClary (Principal Product Manager for Big Data and Hadoop at Oracle).

From the post:

One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them. Moreover, how can we leverage the power of Apache Hadoop to effectively operationalize the model-building process? In this post we’re going to take a look at a simple approach for applying well-known machine learning approaches to our big datasets. We’ll use Pig and Hadoop to quickly parallelize a standalone machine-learning program written in Jython.

Playing Weatherman

I’d like to predict the weather. Heck, we all would – there’s personal and business value in knowing the likelihood of sun, rain, or snow. Do I need an umbrella? Can I sell more umbrellas? Better yet, groups like the National Climatic Data Center offer public access to weather data stretching back to the 1930s. I’ve got a question I want to answer and some big data with which to do it. On first reaction, because I want to do machine learning on data stored in HDFS, I might be tempted to reach for a massively scalable machine learning library like Mahout.

For the problem at hand, that may be overkill and we can get it solved in an easier way, without understanding Mahout. Something becomes apparent on thinking about the problem: I don’t want my climate model for San Francisco to include the weather data from Providence, RI. Weather is a local problem and we want to model it locally. Therefore what we need is many models across different subsets of data. For the purpose of example, I’d like to model the weather on a state-by-state basis. But if I have to build 50 models sequentially, tomorrow’s weather will have happened before I’ve got a national forecast. Fortunately, this is an area where Pig shines.

Two quick observations:

First, Dan makes my point about your needing the “right” data, which may or may not be the same thing as “big data.” Decide what you want to do before you reach for big iron and data.

Second, I never hear references to the “weatherman” without remembering: “you don’t need to be a weatherman to know which way the wind blows.” (link to the manifesto) If you prefer a softer version, Subterranean Homesick Blues by Bob Dylan.

Federal Government Big Data Potential and Realities

Filed under: BigData,Government,Government Data — Patrick Durusau @ 3:01 pm

Federal Government Big Data Potential and Realities (Information Week)

From the post:

Big data has enormous potential in the government sector, though little in the way of uptake and strategy at this point, according to a new report from tech industry advocacy non-profit TechAmerica Foundation.

Leaders of TechAmerica’s Federal Big Data Commission on Wednesday unveiled “Demystifying Big Data: A Practical Guide to Transforming the Business of Government.” The 39-page report provides big data basics like definitions and IT options, as well as potentials for deeper data value and government policy talks. Rife in strategy and pointers more than hard numbers on the impact of existing government data initiatives, the report pointed to big data’s “potential to transform government and society itself” by way of cues from successful data-driven private sector enterprises.

“Unfortunately, in the federal government, daily practice frequently undermines official policies that encourage sharing of information both within and among agencies and with citizens. Furthermore, decision-making by leaders in Congress and the Administration often is accomplished without the benefit of key information and without using the power of Big Data to model possible futures, make predictions, and fundamentally connect the ever increasing myriad of dots and data available,” the report’s authors wrote.

…(a while later)

The report recommended a five-step path to moving ahead with big data initiatives:

  1. Define the big data business opportunity.
  2. Assess existing and needed technical capabilities.
  3. Select a deployment pattern based on the velocity, volume and variety of data involved.
  4. Deploy the program “with an eye toward flexibility and expansion.”
  5. Review program against ROI, government policy and user needs.

Demystifying Big Data: A Practical Guide to Transforming the Business of Government (Report, PDF file) TechAmerica Foundation Big Data Commission (homepage)

The report is well worth your time but I would be cautious about the assumption that all data problems are “big data” problems.

My pre-big data strategy steps would be:

  1. Define the agency mission.
  2. Define the tasks necessary to accomplish #1.
  3. Define the role of data processing, any data processing, in meeting the tasks specified in #2.
  4. Evaluate the relevance of “big data” to the data processing defined in #3. (this is the equivalent of #1 in the commission report)

Unspecified notions about an agency’s mission, tasks to accomplish it, relevance of data processing to those tasks and finally, the relevance of “big data,” will result in disappointing and dysfunctional “Big Data” projects.

“Big data,” its potential, the needs of government, and its citizens, however urgent, are not reasons to abandon traditional precepts of project management.

Deciding on a solution, read “big data techniques,” before you understand and agree upon the problem to be solved, is a classic mistake.

Let’s not make it, again.

Perseus Gives Big Humanities Data Wings

Filed under: Humanities,Marketing,Semantics — Patrick Durusau @ 1:23 pm

Perseus Gives Big Humanities Data Wings by Ian Armas Foster.

From the post:

“How do we think about the human record when our brains are not capable of processing all the data in isolation?” asked Professor Gregory Crane of students in a lecture hall at the University of Kansas.

But when he posed this question, Crane wasn’t referencing modern big data to a bunch of computer science majors. Rather, he was discussing data from ancient texts with a group of those studying the humanities (and one computer science major).

Crane, a professor of classics, adjunct professor of computer science, and chair of Technology and Entrepreneurship at Tufts University, spoke about the efforts of the Perseus Project, a project whose goals include storing and analyzing ancient texts with an eye toward building a global humanities model.

(video omitted)

The next step in humanities is to create that Crane calls “a dialogue among civilizations.” With regard to the study of humanities, it is to connect those studying classical Greek with those studying classical Latin, Arabic, and even Chinese. Like physicists want to model the universe, Crane wants to model the progression of intelligence and art on a global scale throughout human history.

… (a bit later)

Surprisingly, the biggest barrier is not actually the amount of space occupied by the data of the ancient texts, but rather the language barriers. Currently, the Perseus Project covers over a trillion words, but those words are split up into 400 languages. To give a specific example, Crane presented a 12th century Arabic document. It was pristine and easily readable—to anyone who can read ancient Arabic.

Substitute “semantic” for “language” in “language barriers” and I think the comment is right on the mark.

Assuming that you could read the “12th century Arabic document” and understand its semantics, where would you record your reading to pass it along to others?

Say you spot the name of a well known 12th figure. Must every reader duplicate your feat of reading and understanding the document to make that same discovery?

Or can we preserve your “discovery” for other readers?

Topic maps anyone?

ReFr: A New Open-Source Framework for Building Reranking Models

Filed under: Natural Language Processing,Ranking — Patrick Durusau @ 1:09 pm

ReFr: A New Open-Source Framework for Building Reranking Models by Dan Bikel and Keith Hall.

From the post:

We are pleased to announce the release of an open source, general-purpose framework designed for reranking problems, ReFr (Reranker Framework), now available at: http://code.google.com/p/refr/.

Many types of systems capable of processing speech and human language text produce multiple hypothesized outputs for a given input, each with a score. In the case of machine translation systems, these hypotheses correspond to possible translations from some sentence in a source language to a target language. In the case of speech recognition, the hypotheses are possible word sequences of what was said derived from the input audio. The goal of such systems is usually to produce a single output for a given input, and so they almost always just pick the highest-scoring hypothesis.

A reranker is a system that uses a trained model to rerank these scored hypotheses, possibly inducing a different ranked order. The goal is that by employing a second model after the fact, one can make use of additional information not available to the original model, and produce better overall results. This approach has been shown to be useful for a wide variety of speech and natural language processing problems, and was the subject of one of the groups at the 2011 summer workshop at Johns Hopkins’ Center for Language and Speech Processing. At that workshop, led by Professor Brian Roark of Oregon Health & Science University, we began building a general-purpose framework for training and using reranking models. The result of all this work is ReFr.

An interesting software package and you are going to pick up some coding experience as well.

Follow The Data – FEC Campaign Data Challenge

Filed under: Cypher,FEC,Government,Government Data,Graphs,Neo4j — Patrick Durusau @ 5:53 am

Follow The Data – FEC Campaign Data Challenge by Andreas Kollegger.

Take the challenge and you may win a pass to Graph Connect, November 5 & 6 in San Francisco. (Closes 11 October 2012.)

In politics, people are often advised to “follow the money” to understand the forces influencing decisions. As engineers, we know we can do that and more by following the data.

Inspired by some innovative work by Dave Fauth, a Washington DC data analyst, we arranged a workshop to use FEC Campaign data that had been imported into Neo4j.

….

With the data imported, and a basic understanding of the domain model, we then challenged people to write Cypher queries to answer the following questions:

  1. All presidential candidates for 2012
  2. Most mythical presidential candidate
  3. Top 10 Presidential candidates according to number of campaign committees
  4. Find President Barack Obama
  5. Lookup Obama by his candidate ID
  6. Find Presidential Candidate Mitt Romney
  7. Lookup Romney by his candidate ID
  8. Find the shortest path of funding between Obama and Romney
  9. List the 10 top individual contributions to Obama
  10. List the 10 top individual contributions to Romney

Pointers to data, hints await at Andreas’ post.

October 5, 2012

Storing Topic Map Data at $136/TB

Filed under: Data,Storage — Patrick Durusau @ 3:30 pm

Steve Streza describes his storage system in My Giant Hard Drive: Building a Storage Box with FreeNAS.

At his prices, about $136/TB for 11 TB of storage.

Large enough for realistic simulations of data mining or topic mapping. When you want to step up to production, spin up services on one of the clouds.

Not sure it will last you several years as Steve projects but it should last long enough to be worth the effort.

From the post:

For many years, I’ve had a lot of hard drives being used for data storage. Movies, TV shows, music, apps, games, backups, documents, and other data have been moved between hard drives and stored in inconsistent places. This has always been the cheap and easy approach, but it has never been really satisfying. And with little to no redundancy, I’ve suffered a non-trivial amount of data loss as drives die and files get lost. Now, I’m not alone to have this problem, and others have figured out ways of solving it. One of the most interesting has been in the form of a computer dedicated to one thing: storing data, and lots of it. These computers are called network-attached storage, or NAS, computers. A NAS is a specialized computer that has lots of hard drives, a fast connection to the local network, and…that’s about it. It doesn’t need a high-end graphics card, or a 20-inch monitor, or other things we typically associate with computers. It just sits on the network and quietly serves and stores files. There are off-the-shelf boxes you can buy to do this, such as machines made by Synology or Drobo, and you can assemble one yourself for the job.

I’ve been considering making a NAS for myself for over a year, but kept putting it off due to expense and difficulty. But a short time ago, I finally pulled the trigger on a custom assembled machine for storing data. Lots of it; almost 11 terabytes of storage, in fact. This machine is made up of 6 hard drives, and is capable of withstanding a failure on two of them without losing a single file. If any drives do fail, I can replace them and keep on working. And these 11 terabytes act as one giant hard drive, not as 6 independent ones that have to be organized separately. It’s an investment in my storage needs that should grow as I need it to, and last several years.

Memobot

Filed under: Clojure,Data Structures,Redis — Patrick Durusau @ 2:53 pm

Memobot

From the webpage:

Memobot is a data structure server written in clojure. It speaks Redis protocol, so any standard redis client can work with it.

For interests in data structures, Clojure or both.

JugglingDB

Filed under: Database,ORM — Patrick Durusau @ 2:48 pm

JugglingDB

From the webpage:

JugglingDB is cross-db ORM, providing common interface to access most popular database formats. Currently supported are: mysql, mongodb, redis, neo4j and js-memory-storage (yep, self-written engine for test-usage only). You can add your favorite database adapter, checkout one of the existing adapters to learn how, it’s super-easy, I guarantee.

For those of you communing with your favourite databases this weekend. 😉

« Newer PostsOlder Posts »

Powered by WordPress