Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 17, 2012

Big Data in Education (Part 2 of 2)

Filed under: BigData,Education — Patrick Durusau @ 5:08 pm

Big Data in Education (Part 2 of 2) by James Locus.

From the post:

Big data analytics are coming to public education. In 2012, the US Department of Education (DOE) was part of a host of agencies to share a $200 million initiative to begin applying big data analytics to their respective functions. The DOE targeted its $25 million share of the budget toward efforts to understand how students learn at an individualized level. This segment reviews the efforts enumerated in the draft paper released by the DOE on their big data analytics.

The ultimate goal of incorporating big data analytics in education is to improve student outcomes – as determined common metrics like end-of-grade testing, attendance, and dropout rates. Currently, the education sector’s application of big data analytics is to create “learning analytic systems” – here defined as a connected framework of data mining, modeling, and use-case applications.

The hope of these systems is to offer educators better, more accurate information on answer the “how” question in student learning. Is a student performing poor because she is distracted by her environment? Does a failing mark on the end-of-year test mean that the student did not fully grasp the year’s material, or was she having a off day? Learning analytics can help provide information to help educators answer some of these tough, real world questions.

Not complete but a good start on the type of issues that data mining for education and educational measurement are going to have to answer.

As James points out, this has the potential to be a mega-market for big data analytics.

Traditional testing service heavyweights have been in the area for decades.

But one could argue they have documented the decline of education without having the ability to offer any solutions. (Ouch!)

Could be a telling argument as the only response thus far has been to require more annual testing and to punish schools for truthful results.

Contrast that solution with weekly tests in various subjects that is lite-weight and provides reactive feedback to the teacher. So the teacher can address any issues, call in additional resources, the parents, etc. Would be “big data” but also “useful big data.”

Assuming that schools and teachers are provided with the resources to teach “our most precious assets” rather than punished for our failure to support schools and teachers properly.

Scalia and Garner on legal interpretation

Filed under: Language,Law — Patrick Durusau @ 4:50 pm

Scalia and Garner on legal interpretation by Mark Liberman.

Mark writes:

Antonin Scalia and Bryan Garner have recently (June 19) published Reading Law: The Interpretation of Legal Texts, a 608-page work in which, according to the publisher’s blurb, “all the most important principles of constitutional, statutory, and contractual interpretation are systematically explained”.

The post is full of pointers to additional materials both on this publication and notions of legal interpretation more generally.

A glimpse of why I think texts are so complex.

BTW, for the record, I disagree with both Scalia and the post-9/11 Stanley Fish on discovering the “meaning” of texts or authors, respectively. We can report our interpretation of a text, but that isn’t the same thing.

An interpretation is a report we may persuade others to be useful for some purpose, agreeable with their prior beliefs or even consistent with their world view. But for all of that, it remains always our report, nothing more.

The claim of “plain meaning” of words or the “intention” of an author (Scalia, Fish respectively) is an attempt to either avoid moral responsibility for a report or to privilege a report as being more than simply another report. Neither one is particularly honest or useful.

In a marketplace of reports, acknowledged to be reports, we can evaluate, investigate, debate and even choose from among reports.

Scalia and Fish would both advantage some reports over others, probably for different reasons. But whatever their reasons, fair or foul, I prefer to meet all reports on even ground.

Configuring HBase Memstore: What You Should Know

Filed under: HBase — Patrick Durusau @ 4:30 pm

Configuring HBase Memstore: What You Should Know by Alex Baranau.

Alex gives three good reasons to care about HBase Memstore:

There are number of reasons HBase users and/or administrators should be aware of what Memstore is and how it is used:

  • There are number of configuration options for Memstore one can use to achieve better performance and avoid issues. HBase will not adjust settings for you based on usage pattern.
  • Frequent Memstore flushes can affect reading performance and can bring additional load to the system
  • The way Memstore flushes work may affect your schema design

Which reason is yours?

elasticsearch. The Company

Filed under: ElasticSearch,Lucene,Search Engines — Patrick Durusau @ 3:45 pm

elasticsearch. The Company

ElasticSearch needs no introduction to readers of this blog or really anyone active in the search “space.”

It was encouraging to hear that after years of building an ever increasingly useful product, that ElasticSearch has matured into a company.

With all the warm fuzzies that support contracts and such bring.

Sounds like they will demonstrate that the open source and commercial worlds aren’t, you know, incompatible.

It helps that they have a good product in which they have confidence and not a product that their PR/Sales department is pushing as a “good” product. The fear of someone “finding out” would make you real defensive in the latter case.

Looking forward to good fortune for ElasticSearch, its founders and anyone who wants to follow a similar model.

Searching Legal Information in Multiple Asian Languages

Filed under: Law,Legal Informatics,Search Engines — Patrick Durusau @ 2:42 pm

Searching Legal Information in Multiple Asian Languages by Philip Chung, Andrew Mowbray, and Graham Greenleaf.

Abstract:

In this article the Co-Directors of the Australasian Legal Information Institute (AustLII) explain the need for an open source search engine which can search simultaneously over legal materials in European languages and also in Asian languages, particularly those that require a ‘double byte’ representation, and the difficulties this task presents. A solution is proposed, the ‘u16a’ modifications to AustLII’s open source search engine (Sino) which is used by many legal information institutes. Two implementations of the Sino u16A approach, on the Hong Kong Legal Information Institute (HKLII), for English and Chinese, and on the Asian Legal Information Institute (AsianLII), for multiple Asian languages, are described. The implementations have been successful, though many challenges (discussed briefly) remain before this approach will provide a full multi-lingual search facility.

If the normal run of legal information retrieval, across jurisdictions, vocabularies, etc. challenging enough, you can try your hand at cross-language retrieval with European and Asian languages, plus synonyms, etc.

😉

I would like to think the synonymy issue, which is noted as open by this paper, could be addressed in part through the use of topic maps. It would be an evolutionary solution, to be updated as our use and understanding of language evolves.

Any thoughts on Sino versus Lucene/Solr 4.0 (alpha I know but it won’t stay that way forever).

I first saw this at Legal Informatics.

Proposed urn:lex codes for US materials in MLZ

Filed under: Law,Law - Sources,Legal Informatics — Patrick Durusau @ 2:25 pm

Proposed urn:lex codes for US materials in MLZ

From the post:

The MLZ styles rely on a urn:lex-like scheme for specifying the jurisdiction of primary legal materials. We will need to have at least a minimal set of jurisdction codes in place for the styles to be functional. The scheme to be used for this purpose is the subject of this post.

The urn:lex scheme is used in MLZ for the limited purpose of identifying jurisdictional scope: it is not a full document identifier, and does not carry information on the issuing institution itself. Even within this limited scope, the MLZ scheme diverges from the examples provided by the Cornell LII Lexcraft pages, in that the “federal” level is expressed as a geographic scope (set off by a semicolon), rather than as a distinct category of jurisdiction (appended by a period).

Unfortunate software isn’t designed to use existing identification systems.

On the other hand, computer identification systems started when computers were even dumber than they are now. Legacy issue I suppose.

If you are interested in “additional” legal identifier systems, or in the systems that use them, this should be of interest.

Or if you need to map such urn:lex codes to existing identifiers for the same materials. The ones used by people.

I first saw this at Legal Informatics.

If you are in Kolkata/Pune, India…a request.

Filed under: Search Engines,Synonymy,Word Meaning,XML — Patrick Durusau @ 1:55 pm

No emails are given for the authors of: Identify Web-page Content meaning using Knowledge based System for Dual Meaning Words but their locations were listed as Kolkata and Pune, India. I would appreciate your pointing the authors to this blog as one source of information on topic maps.

The authors have re-invented a small part of topic maps to deal with synonymy using XSD syntax. Quite doable but I think they would be better served by either using topic maps or engaging in improving topic maps.

Reinvention is rarely a step forward.

Abstract:

Meaning of Web-page content plays a big role while produced a search result from a search engine. Most of the cases Web-page meaning stored in title or meta-tag area but those meanings do not always match with Web-page content. To overcome this situation we need to go through the Web-page content to identify the Web-page meaning. In such cases, where Webpage content holds dual meaning words that time it is really difficult to identify the meaning of the Web-page. In this paper, we are introducing a new design and development mechanism of identifying the Web-page content meaning which holds dual meaning words in their Web-page content.

Memory Efficient De Bruijn Graph Construction [Attn: Graph Builders, Chess Anyone?]

Filed under: Bioinformatics,Genome,Graphs,Networks — Patrick Durusau @ 10:44 am

Memory Efficient De Bruijn Graph Construction by Yang Li, Pegah Kamousi, Fangqiu Han, Shengqi Yang, Xifeng Yan, and Subhash Suri.

Abstract:

Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the existing de novo assembly algorithms makes it challenging to get the assembly done for higher eukaryotes like mammals. In this work, we investigate the memory issue of constructing de Bruijn graph, a core task in leading assembly algorithms, which often consumes several hundreds of gigabytes memory for large genomes. We propose a disk-based partition method, called Minimum Substring Partitioning (MSP), to complete the task using less than 10 gigabytes memory, without runtime slowdown. MSP breaks the short reads into multiple small disjoint partitions so that each partition can be loaded into memory, processed individually and later merged with others to form a de Bruijn graph. By leveraging the overlaps among the k-mers (substring of length k), MSP achieves astonishing compression ratio: The total size of partitions is reduced from $\Theta(kn)$ to $\Theta(n)$, where $n$ is the size of the short read database, and $k$ is the length of a $k$-mer. Experimental results show that our method can build de Bruijn graphs using a commodity computer for any large-volume sequence dataset.

A discovery in one area of data processing can have a large impact in a number of others. I suspect that will be the case with the technique described here.

The use of substrings for compression and to determine the creation of partitions was particularly clever.

Software and data sets

Questions:

  1. What are the substring characteristics of your data?
  2. How would you use a De Bruijn graph with your data?

If you don’t know the answers to those questions, you might want to find out.

Additional Resources:

De Bruijn Graph (Wikipedia)

De Bruijn Sequence (Wikipedia)

How to apply de Bruijn graphs to genome assembly by Phillip E C Compeau, Pavel A Pevzner, and Glenn Tesler. Nature Biotechnology 29, 987–991 (2011) doi:10.1038/nbt.2023

And De Bruijn graphs/sequences are not just for bioinformatics: from the Chess Programming Wiki: De Bruijn Sequences. (Lots of pointers and additional references.)

July 16, 2012

Processing Public Data with R

Filed under: Environment,Government Data,R — Patrick Durusau @ 4:30 pm

Processing Public Data with R

From the post:

I use R aplenty in analysis and thought it might be worthwhile for some to see the typical process a relative newcomer goes through in extracting and analyzing public datasets

In this instance I happen to be looking at Canadian air pollution statistics.

The data I am interested in is available on the Ontario Ministry of Environment’s website. I have downloaded the hourly ozone readings from two weather stations (Grand Bend and Toronto West) for two years (2000 and 2011) which are available in several formats , including my preference, csv. According to the 2010 annual report from the Ministry, the two selected represent the extremes in readings for that year

I firstly set the directory in which the code and the associated datafiles will reside and import the data. I would normally load any R packages I will utilize at the head of the script (if not already in my start up file) but will hold off here until they are put to use.

I had to do a small amount of row deletion in the csv files so that only the readings data was included

A useful look at using R to manipulate public data.

Do you know of any articles on using R to output topic maps?

Experimenting with MapReduce 2.0

Filed under: Hadoop,MapReduce 2.0 — Patrick Durusau @ 4:23 pm

Experimenting with MapReduce 2.0 by Ahmed Radwan.

In Building and Deploying MR2, we presented a brief introduction to MapReduce in Hadoop 0.23 and focused on the steps to setup a single-node cluster. In MapReduce 2.0 in Hadoop 0.23, we discussed the new architectural aspects of the MapReduce 2.0 design. This blog post highlights the main issues to consider when migrating from MapReduce 1.0 to MapReduce 2.0. Note that both MapReduce 1.0 and MapReduce 2.0 are included in CDH4.

It is important to note that, at the time of writing this blog post, MapReduce 2.0 is still Alpha, and it is not recommended to use it in production.

In the rest of this post, we shall first discuss the Client API, followed by configurations and testing considerations, and finally commenting on the new changes related to the Job History Server and Web Servlets. We will use the terms MR1 and MR2 to refer to MapReduce in Hadoop 1.0 and Hadoop 2.0, respectively.

How long MapReduce 2.0 remains in alpha is anyone’s guess. Suggest we start to learn about it before that status passes.

Graphing every idea in history

Filed under: Graphs,Humor,Networks — Patrick Durusau @ 3:20 pm

Graphing every idea in history by Nathan Yau.

I did a spot check and my idea about …., well, never mind, it wasn’t listed. (good thing)

Then I read that “every” idea meant only those in Wikipedia with an “”influenced by” or “influences” field.’

Started to breath a little easier. 😉

Interesting work but think about the number of facts that you know. Facts that influence your opinions and judgements that aren’t captured in any “fact” database.

Data Mining In Excel: Lecture Notes and Cases (2005)

Filed under: Data Mining,Excel,Microsoft — Patrick Durusau @ 3:03 pm

Data Mining In Excel: Lecture Notes and Cases (2005) by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce.

From the introduction:

This book arose out of a data mining course at MIT’s Sloan School of Management. Preparation for the course revealed that there are a number of excellent books on the business context of data mining, but their coverage of the statistical and machine-learning algorithms that underlie data mining is not sufficiently detailed to provide a practical guide if the instructor’s goal is to equip students with the skills and tools to implement those algorithms. On the other hand, there are also a number of more technical books about data mining algorithms, but these are aimed at the statistical researcher, or more advanced graduate student, and do not provide the case-oriented business focus that is successful in teaching business students.

Hence, this book is intended for the business student (and practitioner) of data mining techniques, and its goal is threefold:

  1. To provide both a theoretical and practical understanding of the key methods of classification, prediction, reduction and exploration that are at the heart of data mining;
  2. To provide a business decision-making context for these methods;
  3. Using real business cases, to illustrate the application and interpretation of these methods.

An important feature of this book is the use of Excel, an environment familiar to business analysts. All required data mining algorithms (plus illustrative datasets) are provided in an Excel add-in, XLMiner. XLMiner offers a variety of data mining tools: neural nets, classification and regression trees, k-nearest neighbor classification, naive Bayes, logistic regression, multiple linear regression, and discriminant analysis, all for predictive modeling. It provides for automatic partitioning of data into training, validation and test samples, and for the deployment of the model to new data. It also offers association rules, principal components analysis, k-means clustering and hierarchical clustering, as well as visualization tools, and data handling utilities. With its short learning curve, affordable price, and reliance on the familiar Excel platform, it is an ideal companion to a book on data mining for the business student.

Some what dated but remember there are lots of older copies of MS Office around. Not an inconsiderable market if you start to write something on using Excel to produce topic maps. Write for the latest version but I would have a version keyed to earlier versions of Excel as well.

I first saw this at KDNuggets.

UH Data Mining Hypertextbook

Filed under: Data Mining — Patrick Durusau @ 2:24 pm

UH Data Mining Hypertextbook by Professor Rakesh Verma and his students at U. of Houston.

From the contents:

Chapter 1. Decision Trees

This chapter provides an introduction to one of the major fields of data mining called classification. It also outlines some of the real world applications of classification tools and introduces the decision tree classifier that is widely used. What is a classifier? What is a decision tree? How does one construct a decision tree? These are just some of the questions answered in this chapter. Currently the novice (green) and intermediate (blue) tracks are active. More content will be added to this chapter over time.

Chapter 2. Association Analysis

In this chapter we explore another major field of data mining called association rules. Association Analysis focuses on discovering association rules which are interesting and useful hidden relationships that can be found in large data sets. This chapter is divided into various sections that explain the key concepts in Association Analysis and introduce you, the reader, to the basic algorithms used in generating Association Rules. Currently the novice (green) and intermediate (blue) tracks are active. More content will be added to this chapter over time.

Chapter 3. Visualization

In this chapter we take a step back from data mining algorithms and techniques and focus on the visualization of data. This step is crucial and normally takes place before any data mining algorithms have been applied or pre-processing techniques, it is useful because it helps us in some situations pinpoint which algorithms should be used in future analysis. This chapter is divided into three main sections, the first section introduces you, the reader to visualization, the second defines general concepts that are pertinent and the third section explores a couple of visualization techniques. This capter also includes a brief introduction to OLAP. More content will be added to this chapter over time.

Chapter 4. Cluster Analysis

In this chapter we pick up from where classification left off and delve a little bit deeper into the world of grouping data objects. Cluster analysis aims to group data objects based on the information that is available that describes the objects and their relationships. This chapter is first introduces the concept of cluster analysis and its applications in the real world and then it explores some of the popular clustering techniques such as the k-means clustering algorithm and agglomerative hierarchical clustering. More content will be added to this chapter over time.

Appendix 1. Includes direct links to Java Applets, online links to additional resources and a list of references

As citations are made to literature, the corresponding references are kept in this appendix. Links to this appendix accompany the citations.

Note that navigation is by drop-down menus at the top of pages, for the book and chapters. Pages have “next” links at the bottom. Not a problem, just something to get used to.

First saw this at KDNuggets:

Happy Birthday Hortonworks!

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 2:04 pm

Happy Birthday Hortonworks! by Eric Baldeschwieler.

From the post:

Last week was an important milestone for Hortonworks: our one year anniversary. Given all of the activity around Apache Hadoop and Hortonworks, it’s hard to believe it’s only been one year. In honor of our birthday, I thought I would look back to contrast our original intentions with what we delivered over the past year.

Hortonworks was officially announced at Hadoop Summit 2011. At that time, I published a blog on the Hortonworks Manifesto. This blog told our story, including where we came from, what motivated the original founders and what our plans were for the company. I wanted to address many of the important statements from this blog here:

Read the post in full to see Eric’s take on:

Hortonworks was formed to “accelerate the development and adoption of Apache Hadoop”. …

We are “committed to open source” and commit that “all core code will remain open source”. …

We will “make Apache Hadoop easier to install, manage and use”. …

We will “make Apache Hadoop more robust”. …

We will “make Apache Hadoop easier to integrate and extend”. …

We will “deliver an ever-increasing array of services aimed at improving the Hadoop experience and support in the growing needs of enterprises, systems integrators and technology vendors”. …

This has been a banner year for Hortonworks, the Hadoop ecosystem and everyone concerned with this rapidly developing area!

We are looking forward to the next year being more of same, except more so!

Spring Data: Modern Data Access for Enterprise Java

Filed under: Neo4j,Spring Data — Patrick Durusau @ 1:55 pm

Spring Data: Modern Data Access for Enterprise Java by Mark Pollack, Oliver Gierke, Thomas Risberg, Jon Brisbin, and Michael Hunger.

An Open Feedback Publishing System title from O’Reilly.

I encountered it because of an automated search for Neo4j materials discovered: Spring Data Neo4j.

I would bookmark the Spring Data website.

International BASP Frontiers Workshop 2013

Filed under: Astroinformatics,Biomedical,Conferences,Signal/Collect — Patrick Durusau @ 1:28 pm

International BASP Frontiers Workshop 2013

January 27th – February 1st, 2013 Villars-sur-Ollon (Switzerland)

The international biomedical and astronomical signal processing (BASP) Frontiers workshop was created to promote synergies between selected topics in astronomy and biomedical sciences, around common challenges for signal processing.

The 2013 workshop will concentrate on the themes of sparse signal sampling and reconstruction, for radio interferometry and MRI, but also open its floor to many other interesting hot topics in theoretical, astrophysical, and biomedical signal processing.

Signal processing is one form of “big data” and is rich in subjects, both in the literature and in the data.

Proceedings from the first BASP workshop are available. Be advised it is a 354 MB zip file. If you aren’t on an airport wifi, you can find those proceedings here.

July 15, 2012

Categorization of interestingness measures for knowledge extraction

Filed under: Knowledge Capture,Statistics — Patrick Durusau @ 7:57 pm

Categorization of interestingness measures for knowledge extraction by Sylvie Guillaume, Dhouha Grissa, and Engelbert Mephu Nguifo.

Abstract:

Finding interesting association rules is an important and active research field in data mining. The algorithms of the Apriori family are based on two rule extraction measures, support and confidence. Although these two measures have the virtue of being algorithmically fast, they generate a prohibitive number of rules most of which are redundant and irrelevant. It is therefore necessary to use further measures which filter uninteresting rules. Many synthesis studies were then realized on the interestingness measures according to several points of view. Different reported studies have been carried out to identify “good” properties of rule extraction measures and these properties have been assessed on 61 measures. The purpose of this paper is twofold. First to extend the number of the measures and properties to be studied, in addition to the formalization of the properties proposed in the literature. Second, in the light of this formal study, to categorize the studied measures. This paper leads then to identify categories of measures in order to help the users to efficiently select an appropriate measure by choosing one or more measure(s) during the knowledge extraction process. The properties evaluation on the 61 measures has enabled us to identify 7 classes of measures, classes that we obtained using two different clustering techniques.

It will take some time to run down the original papers but I am curious in the mean time if:

  1. Anyone agrees or disagrees with the reduction of measures as having different names (page 10)?
  2. Anyone agrees or disagrees with the classification of measures into seven groups (pages 10-11)?

‘Sounds of Silence’ Proving a Hit: World’s Fastest Random Number Generator

Filed under: Random Numbers,Security — Patrick Durusau @ 4:26 pm

‘Sounds of Silence’ Proving a Hit: World’s Fastest Random Number Generator

From the post:

Researchers at The Australian National University have developed the fastest random number generator in the world by listening to the ‘sounds of silence’.

The researchers — Professor Ping Koy Lam, Dr Thomas Symul and Dr Syed Assad from the ANU ARC Centre of Excellence for Quantum Computation and Communication Technology — have tuned their very sensitive light detectors to listen to vacuum — a region of space that is empty.

Professor Lam said vacuum was once thought to be completely empty, dark, and silent until the discovery of the modern quantum theory. Since then scientists have discovered that vacuum is an extent of space that has virtual sub-atomic particles spontaneously appearing and disappearing.

It is the presence of these virtual particles that give rise to random noise. This ‘vacuum noise’ is omnipresent and may affect and ultimately pose a limit to the performances of fibre optic communication, radio broadcasts and computer operation.

“While it has always been thought to be an annoyance that engineers and scientists would like to circumvent, we instead exploited this vacuum noise and used it to generate random numbers,” Professor Lam said.

“Random number generation has many uses in information technology. Global climate prediction, air traffic control, electronic gaming, encryption, and various types of computer modelling all rely on the availability of unbiased, truly random numbers.

All the talk about security and trust reminded me of this post.

Just in case your topic map software needs random numbers for encryption or other purposes.

See: Quantum Random Number Generator for papers and a live random number feed.

Assuming you “trust” some alphabet soup agency has not spoofed the IP address and has its own feed of pseudo-random numbers in place of the real one.

If not, you need to build your own quantum detector, assuming you “trust” the parts have not been altered to produces their “random” numbers.

If not, you could build your own parts, but only if you remember to wear your tin hat at all times to prevent/interfere with mind control efforts.

Trust is a difficult issue.

How To Use A Graph Database to Integrate And Analyze Relational Exports

Filed under: Graph Databases,Graphs,InfiniteGraph,RDBMS — Patrick Durusau @ 4:08 pm

How To Use A Graph Database to Integrate And Analyze Relational Exports by Todd Stavish.

From the post:

Graph databases can be used to analyze data from disparate datasources. In this use-case, three relational databases have been exported to CSV. Each relational export is ingested into its own sharded sub-graph to increase performance and avoid lock contention when merging the datasets. Unique keys overlap the datasources to provide the mechanism to link the subgraphs produced from parsing the CSV. A REST server is used to send the merged graph to a visualization application for analysis.

Cleaning out my pending posts file when I ran this one.

Would be a good comparison case for my topic maps class.

Although I would have to do in installation work on a public facing server and leave the class members to do the analysis/uploading.

Hmmm, perhaps split the class into teams, some of which using this method, some using more traditional record linkage and some using topic maps, all on the same data.

Suggestions on data sets that would highlight the differences? Or result in few differences at all? (I suspect both to be true, depending upon the data sets.)

Interactive Dynamics for Visual Analysis

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 3:57 pm

Interactive Dynamics for Visual Analysis by Jeffrey Heer and Ben Shneiderman.

From the article:

The increasing scale and availability of digital data provides an extraordinary resource for informing public policy, scientific discovery, business strategy, and even our personal lives. To get the most out of such data, however, users must be able to make sense of it: to pursue questions, uncover patterns of interest, and identify (and potentially correct) errors. In concert with data-management systems and statistical algorithms, analysis requires contextualized human judgments regarding the domain-specific significance of the clusters, trends, and outliers discovered in data.

Visualization provides a powerful means of making sense of data. By mapping data attributes to visual properties such as position, size, shape, and color, visualization designers leverage perceptual skills to help users discern and interpret patterns within data. [cite omitted] A single image, however, typically provides answers to, at best, a handful of questions. Instead, visual analysis typically progresses in an iterative process of view creation, exploration, and refinement. Meaningful analysis consists of repeated explorations as users develop insights about significant relationships, domain-specific contextual influences, and causal patterns. Confusing widgets, complex dialog boxes, hidden operations, incomprehensible displays, or slow response times can limit the range and depth of topics considered and may curtail thorough deliberation and introduce errors. To be most effective, visual analytics tools must support the fluent and flexible use of visualizations at rates resonant with the pace of human thought.

The goal of this article is to assist designers, researchers, professional analysts, procurement officers, educators, and students in evaluating and creating visual analysis tools. We present a taxonomy of interactive dynamics that contribute to successful analytic dialogues. The taxonomy consists of 12 task types grouped into three high-level categories, as shown in table 1: (1) data and view specification (visualize, filter, sort, and derive); (2) view manipulation (select, navigate, coordinate, and organize); and (3) analysis process and provenance (record, annotate, share, and guide). These categories incorporate the critical tasks that enable iterative visual analysis, including visualization creation, interactive querying, multiview coordination, history, and collaboration. Validating and evolving this taxonomy is a community project that proceeds through feedback, critique, and refinement.

This rocks! I missed it earlier this year but you should not miss it now! (BTW, if you see something interesting, post a note to patrick@durusau.net. I miss lots of interesting and important things. Share what you see with others!)

Two lessons I would draw from this article:

  1. Visual analysis, enabled by the number-crunching and display capabilities of modern computers, is just in its infancy, if that far along. This is a rich area for research and experimentation.
  2. There is no “correct” visualization for any data set. Only ones that give a particular analyst more or less insight into a given data set. What visualizations work for one task or user may not be appropriate for another.

Nutch 1.5/1.5.1 [Cloud Setup for Experiements?]

Filed under: Cloud Computing,Nutch,Search Engines — Patrick Durusau @ 3:41 pm

Before the release of Nutch 2.0, there was the release of Nutch 1.5 and 1.5.1.

From the 1.5 release note:

The 1.5 release of Nutch is now available. This release includes several improvements including upgrades of several major components including Tika 1.1 and Hadoop 1.0.0, improvements to LinkRank and WebGraph elements as well as a number of new plugins covering blacklisting, filtering and parsing to name a few. Please see the list of changes

http://www.apache.org/dist/nutch/CHANGES-1.5.txt

[WRONG URL – Should be: http://www.apache.org/dist/nutch/1.5/CHANGES-1.5.txt (version /1.5/” missing from the path, took me a while to notice the nature of the problem.)]

made in this version for a full breakdown of the 50 odd improvements the release boasts. A full PMC release statement can be found below

http://nutch.apache.org/#07+June+2012+-+Apache+Nutch+1.5+Released

Apache Nutch is an open source web-search software project. Stemming from Apache Lucene, it now builds on Apache Solr adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and and array other document formats. Nutch can run on a single machine, but gains a lot of its strength from running in a Hadoop cluster. The system can be enhanced (eg other document formats can be parsed) using a highly flexible, easily extensible and thoroughly maintained plugin infrastructure.

Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/

And 1.5.1:

http://www.apache.org/dist/nutch/1.5.1/CHANGES.txt

Nutch is available in source and binary form (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/

Question: Would you put together some commodity boxes for local experimentation or would you spin up an installation in one of the clouds?

As hot as the summer promises to be near Atlanta, I am leaning towards the cloud route.

As I write that I can hear a close friend from the West Coast shouting “…trust, trust issues….” But I trust the local banking network, credit card, utilities, finance, police/fire, etc., with just as little reason as any of the “clouds.”

Not really even “trust,” I don’t even think about it. The credit card industry knows $X fraud is going to occur every year and it is a cost of liquid transactions. So they allow for it in their fees. They proceed in the face of known rates of fraud. How’s that for trust? 😉 Trusting fraud is going to happen.

Same will be true for the “clouds” and mechanisms will evolve to regulate the amount of exposure versus potential damage. I am going to be experimenting with non-client data so the worst exposure I have is loss of time. Perhaps some hard lessons learned on configuration/security. But hardly a reason to avoid the “clouds” and to incur the local hardware cost.

I was serious when I suggested governments should start requiring side by side comparison of hardware costs for local installs versus cloud services. I would call the major cloud services up and ask them for independent bids.

Would the “clouds” be less secure? Possibly, but I don’t think any of them allow Lady Gaga CDs on premises.

Apache Nutch v2.0 Release

Filed under: Nutch,Search Engines — Patrick Durusau @ 10:18 am

Apache Nutch v2.0 Release

From the post:

The Apache Nutch PMC are very pleased to announce the release of Apache Nutch v2.0. This release offers users an edition focused on large scale crawling which builds on storage abstraction (via Apache Gora™) for big data stores such as Apache Accumulo™, Apache Avro™, Apache Cassandra™, Apache HBase™, HDFS™, an in memory data store and various high profile SQL stores. After some two years of development Nutch v2.0 also offers all of the mainstream Nutch functionality and it builds on Apache Solr™ adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika™ for HTML and an array other document formats. Nutch v2.0 shadows the latest stable mainstream release (v1.5.X) based on Apache Hadoop™ and covers many use cases from small crawls on a single machine to large scale deployments on Hadoop clusters. Please see the list of changes

http://www.apache.org/dist/nutch/2.0/CHANGES.txt made in this version for a full breakdown..

A full PMC release statement can be found below:

http://nutch.apache.org/#07+July+2012+-+Apache+Nutch+v2.0+Released

Nutch v2.0 is available in source (zip and tar.gz) from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/2.0

The Ontology for Biomedical Investigations (OBI)

Filed under: Bioinformatics,Biomedical,Medical Informatics,Ontology — Patrick Durusau @ 9:40 am

The Ontology for Biomedical Investigations (OBI)

From the webpage:

The Ontology for Biomedical Investigations (OBI) project is developing an integrated ontology for the description of biological and clinical investigations. This includes a set of ‘universal’ terms, that are applicable across various biological and technological domains, and domain-specific terms relevant only to a given domain. This ontology will support the consistent annotation of biomedical investigations, regardless of the particular field of study. The ontology will represent the design of an investigation, the protocols and instrumentation used, the material used, the data generated and the type analysis performed on it. Currently OBI is being built under the Basic Formal Ontology (BFO).

  • Develop an Ontology for Biomedical Investigations in collaboration with groups representing different biological and technological domains involved in Biomedical Investigations
  • Make OBI compatible with other bio-ontologies
  • Develop OBI using an open source approach
  • Create a valuable resource for the biomedical communities to provide a source of terms for consistent annotation of investigations

An ontology that will be of interest if you are integrating biomedical materials.

At least as a starting point.

My listing of ontologies, vocabularies, etc., for any field are woefully incomplete for any field and represent at best starting points for your own, more comprehensive investigations. If you do find these starting points useful, please send pointers to your more complete investigations for any field.

Functional Genomics Data Society – FGED

Filed under: Bioinformatics,Biomedical,Functional Genomics — Patrick Durusau @ 9:29 am

Functional Genomics Data Society – FGED

While searching out the MAGE-TAB standard, I found:

The Functional Genomics Data Society – FGED Society, founded in 1999 as the MGED Society, advocates for open access to genomic data sets and works towards providing concrete solutions to achieve this. Our goal is to assure that investment in functional genomics data generates the maximum public benefit. Our work on defining minimum information specifications for reporting data in functional genomics papers have already enabled large data sets to be used and reused to their greater potential in biological and medical research.

We work with other organisations to develop standards for biological research data quality, annotation and exchange. We facilitate the creation and use of software tools that build on these standards and allow researchers to annotate and share their data easily. We promote scientific discovery that is driven by genome wide and other biological research data integration and meta-analysis.

Home of:

Along with links to other resources and collaborations.

ISA-TAB

Filed under: Bioinformatics,Biomedical,Genome — Patrick Durusau @ 9:06 am

ISA-TAB format page at SourceForge.

Where you will find:

ISA-TAB 1.0 – Candidate release (PDF file)

Example ISA-TAB files.

ISAValidator

Abstract from ISA-TAB 1.0:

This document describes ISA-TAB, a general purpose framework with which to capture and communicate the complex metadata required to interpret experiments employing combinations of technologies, and the associated data files. Sections 1 to 3 introduce the ISA-TAB proposal, describe the rationale behind its development, provide an overview of its structure and relate it to other formats. Section 4 describes the specification in detail; section 5 provides examples of design patterns.

ISA-TAB builds on the existing paradigm that is MAGE-TAB – a tab-delimited format to exchange microarray data. ISA-TAB necessarily maintains backward compatibility with existing MAGE-TAB files to facilitate adoption; conserving the simplicity of MAGE-TAB for simple experimental designs, while incorporating new features to capture the full complexity of experiments employing a combination of technologies. Like MAGE-TAB before it, ISA-TAB is simply a format; the decision on how to regulate its use (i.e. enforcing completion of mandatory fields or use of a controlled terminology) is a matter for those communities, which will implement the format in their systems and for which submission and exchange of minimal information is critical. In this case, an additional layer or of constraints should be agreed and required on top of the ISA-TAB specification.

Knowledge of the MAGE-TAB format is required, on which see: MAGE-TAB.

As terminologies/vocabularies/ontologies evolve, ISA-TAB formatted files are a good example of targets for topic maps.

Researchers can continue their use of ISA-TAB formatted files undisturbed by changes in terminology, vocabulary or even ontology due to the semantic navigation layer provided by topic maps.

Or perhaps more correctly, one researcher or librarian can create a mapping of such changes that benefit all the other members of their lab.

GigaScience

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:32 am

GigaScience

From the description:

GigaScience is a new integrated database and journal co-published in collaboration between BGI Shenzhen and BioMed Central, to meet the needs of a new generation of biological and biomedical research as it enters the era of “big-data.” BGI (formerly known as Beijing Genomics Institute) was founded in 1999 and has since become the largest genomic organization in the world and has a proven track record of innovative, high profile research.

To achieve its goals, GigaScience has developed a novel publishing format that integrates manuscript publication with a database that will provide DOI assignment to every dataset. Supporting the open-data movement, we require that all supporting data and source code be publically available in a suitable public repository and/or under a public domain CC0 license in the BGI GigaScience database. Using the BGI cloud as a test environment, we also consider open-source software tools/methods for the analysis or handling of large-scale data. When submitting a manuscript, please contact us if you have datasets or cloud applications you would like us to host. To maximize data usability submitters are encouraged to follow best practice for metadata reporting and are given the opportunity to submit in ISA-Tab format.

A new journal to watch. One of the early articles is accompanied by an 83 GB data file.

Doing a separate post on the ISA-Tab format.

While I write that, image a format that carries with it known subject mappings into the literature? Or references to subject mappings into the literature?

July 14, 2012

Journal of Proteomics & Bioinformatics

Filed under: Bioinformatics,Proteomics — Patrick Durusau @ 12:44 pm

Journal of Proteomics & Bioinformatics

From Aims and Scope:

Journal of Proteomics & Bioinformatics (JPB), a broad-based journal was founded on two key tenets: To publish the most exciting researches with respect to the subjects of Proteomics & Bioinformatics. Secondly, to provide a rapid turn-around time possible for reviewing and publishing, and to disseminate the articles freely for research, teaching and reference purposes. [The rest was boilerplate about open access so I didn’t bother repeating it.]

Another open access journal from Omics Publishing Group but this one has a publication history back to 2008.

Will look through the archive for material of interest.

Journal of Data Mining in Genomics and Proteomics

Filed under: Bioinformatics,Biomedical,Data Mining,Genome,Medical Informatics,Proteomics — Patrick Durusau @ 12:20 pm

Journal of Data Mining in Genomics and Proteomics

From the Aims and Scope page:

Journal of Data Mining in Genomics & Proteomics (JDMGP), a broad-based journal was founded on two key tenets: To publish the most exciting researches with respect to the subjects of Proteomics & Genomics. Secondly, to provide a rapid turn-around time possible for reviewing and publishing, and to disseminate the articles freely for research, teaching and reference purposes.

In today’s wired world information is available at the click of the button, curtsey the Internet. JDMGP-Open Access gives a worldwide audience larger than that of any subscription-based journal in OMICS field, no matter how prestigious or popular, and probably increases the visibility and impact of published work. JDMGP-Open Access gives barrier-free access to the literature for research. It increases convenience, reach, and retrieval power. Free online literature is available for software that facilitates full-text searching, indexing, mining, summarizing, translating, querying, linking, recommending, alerting, “mash-ups” and other forms of processing and analysis. JDMGP-Open Access puts rich and poor on an equal footing for these key resources and eliminates the need for permissions to reproduce and distribute content.

A publication (among many) from the OMICS Publishing Group, which sponsors a large number of online publications.

Has the potential to be an interesting source of information. Not much in the way of back files but then it is a very young journal.

Text Mining Methods Applied to Mathematical Texts

Filed under: Indexing,Mathematics,Mathematics Indexing,Search Algorithms,Searching — Patrick Durusau @ 10:49 am

Text Mining Methods Applied to Mathematical Texts (slides) by Yannis Haralambous, Département Informatique, Télécom Bretagne.

Abstract:

Up to now, flexiform mathematical text has mainly been processed with the intention of formalizing mathematical knowledge so that proof engines can be applied to it. This approach can be compared with the symbolic approach to natural language processing, where methods of logic and knowledge representation are used to analyze linguistic phenomena. In the last two decades, a new approach to natural language processing has emerged, based on statistical methods and, in particular, data mining. This method, called text mining, aims to process large text corpora, in order to detect tendencies, to extract information, to classify documents, etc. In this talk I will present math mining, namely the potential applications of text mining to mathematical texts. After reviewing some existing works heading in that direction, I will formulate and describe several roadmap suggestions for the use and applications of statistical methods to mathematical text processing: (1) using terms instead of words as the basic unit of text processing, (2) using topics instead of subjects (“topics” in the sense of “topic models” in natural language processing, and “subjects” in the sense of various mathematical subject classifications), (3) using and correlating various graphs extracted from mathematical corpora, (4) use paraphrastic redundancy, etc. The purpose of this talk is to give a glimpse on potential applications of the math mining approach on large mathematical corpora, such as arXiv.org.

An invited presentation at CICM 2012.

I know Yannis from a completely different context and may comment on that in another post.

No paper but 50+ slides showing existing text mining tools can deliver useful search results, while waiting for a unified and correct index to all of mathematics. 😉

Varying semantics, as in all human enterprises, is an opportunity for topic map based assistance.

Conferences on Intelligent Computer Mathematics (CICM 2012)

Filed under: Conferences,Geometry,Knowledge Management,Mathematics,Mathematics Indexing — Patrick Durusau @ 10:34 am

Conferences on Intelligent Computer Mathematics (CICM 2012) (talks listing)

From the “general information” page:

As computers and communications technology advance, greater opportunities arise for intelligent mathematical computation. While computer algebra, automated deduction, mathematical publishing and novel user interfaces individually have long and successful histories, we are now seeing increasing opportunities for synergy among these areas.

The conference is organized by Serge Autexier (DFKI) and Michael Kohlhase (JUB), takes place at Jacobs University in Bremen and consists of five tracks

The overall programme is organized by the General Program Chair Johan Jeuring.

Which I located by following the conference reference in: An XML-Format for Conjectures in Geometry (Work-in-Progress)

A real treasure trove of research on searching, semantics, integration, focused on computers and mathematics.

Expect to see citations to work reported here and in other CICM proceedings.

« Newer PostsOlder Posts »

Powered by WordPress