August « 2012 « Another Word For It

August 20, 2012

Erlang Cheat Sheet [And Cheat Sheets in General]

Filed under: Erlang,Marketing — Patrick Durusau @ 8:07 am

Erlang Cheat Sheet

Fairly short (read limited) cheat sheet on Erlang. Found at: http://www.cheatography.com/

Has a number of cheat sheets and is in the process of creating a cheat sheet template.

Questions that come to mind:

Using a topic map to support a cheat sheet, what more would you expect to see? Links to fuller examples? Links to manuals? Links to sub-cheat sheets?
Have you seen any ontology cheat sheets? For coding consistency, that sounds like something that could be quite handy.
For existing ontologies, any research on frequency of use to support the creation of cheat sheets? (Would not waste space on “thing” for example. Too unlikely to bear mentioning.)

Comments (1)

Deploying Neo4j Graph Database Server across AWS regions with Puppet

Filed under: Amazon Web Services AWS,Graphs,Neo4j,Networks — Patrick Durusau @ 7:36 am

Deploying Neo4j Graph Database Server across AWS regions with Puppet by Jussi Heinonen.

From the post:

It’s been more than a year now since I rolled out Neo4j Graph Database Server image in Amazon EC2.

In May 2011 the version of Neo4j was 1.3 and just recently guys at Neo Technologies published version 1.7.2 so I thought now is the time to revisit this exercise and make fresh AMIs available.

Last year I created Neo4j AMI manually in one region then copied it across to the remaining AWS regions. Due to the size of the AMI and the latency between regions this process was slow.

If you aren’t already familiar with AWS, perhaps this will be your incentive to take the plunge.

Learning Puppet and Neo4j are just a lagniappe.

Comments Off

ScalaNLP

Filed under: Scala,ScalaNLP — Patrick Durusau @ 7:21 am

ScalaNLP

From the homepage:

ScalaNLP is a suite of machine learning and numerical computing libraries.

ScalaNLP is the umbrella project for Breeze and Epic. Breeze is a set of libraries for machine learning and numerical computing. Epic (coming soon) is a high-performance statistical parser.

From the about page:

Breeze is a suite of Scala libraries for numerical processing, machine learning, and natural language processing. Its primary focus is on being generic, clean, and powerful without sacrificing (much) efficiency.

The library currently consists of several parts:

breeze-math: Linear algebra and numerics routines

breeze-process: Libraries for processing text and managing data pipelines.

breeze-learn: Machine Learning, Statistics, and Optimization.

Possible future releases:

breeze-viz: Vizualization and plotting

breeze-fst: Finite state toolkit

Breeze is the merger of the ScalaNLP and Scalala projects, because one of the original maintainers is unable to continue development. The Scalala parts are largely rewritten.

Epic is a high-performance statistical parser written in Scala. It uses Expectation Propagation to build complex models without suffering the exponential runtimes one would get in a naive model. Epic is nearly state-of-the-art on the standard benchmark dataset in Natural Language Processing. We will be releasing Epic soon.

In case you are interested in project history, Scalala source.

A fairly new community so drop by and say hello.

Comments Off

August 19, 2012

Collaborative filtering with GraphChi

Filed under: GraphChi,Graphs — Patrick Durusau @ 6:43 pm

Collaborative filtering with GraphChi by Danny Bickson.

From the post:

A couple of weeks ago I covered GraphChi by Aapo Kyrola in my blog.

Here is a quick tutorial for trying out GraphChi collaborative filtering. Currently it supports ALS (alternating least squares), SGD (stochastic gradient descent), bias-SGD (biased stochastic gradient descent) and SVD++, but I am soon going to implement several more algorithms.

If you are already experimenting with GraphChi, you will really like this post.

Comments (1)

Bi-directional semantic similarity….

Filed under: Bioinformatics,Biomedical,Semantics,Similarity — Patrick Durusau @ 6:32 pm

Bi-directional semantic similarity for gene ontology to optimize biological and clinical analyses by Sang Jay Bien, Chan Hee Park, Hae Jin Shim, Woongcheol Yang, Jihun Kim and Ju Han Kim.

Abstract:

Background Semantic similarity analysis facilitates automated semantic explanations of biological and clinical data annotated by biomedical ontologies. Gene ontology (GO) has become one of the most important biomedical ontologies with a set of controlled vocabularies, providing rich semantic annotations for genes and molecular phenotypes for diseases. Current methods for measuring GO semantic similarities are limited to considering only the ancestor terms while neglecting the descendants. One can find many GO term pairs whose ancestors are identical but whose descendants are very different and vice versa. Moreover, the lower parts of GO trees are full of terms with more specific semantics.

Methods This study proposed a method of measuring semantic similarities between GO terms using the entire GO tree structure, including both the upper (ancestral) and the lower (descendant) parts. Comprehensive comparison studies were performed with well-known information content-based and graph structure-based semantic similarity measures with protein sequence similarities, gene expression-profile correlations, protein–protein interactions, and biological pathway analyses.

Conclusion The proposed bidirectional measure of semantic similarity outperformed other graph-based and information content-based methods.

Makes me curious what the experience with direction and identification has been with other ontologies?

Comments Off

Concept Annotation in the CRAFT corpus

Filed under: Bioinformatics,Biomedical,Corpora,Natural Language Processing — Patrick Durusau @ 4:47 pm

Concept Annotation in the CRAFT corpus by Michael Bada, Miriam Eckert, Donald Evans, Kristin Garcia, Krista Shipley, Dmitry Sitnikov, William A. Baumgartner, K. Bretonnel Cohen, Karin Verspoor, Judith A. Blake and Lawrence E. Hunter by BMC Bioinformatics 2012, 13:161 doi:10.1186/1471-2105-13-161.

Abstract:

Background

Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.

Results

This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.

Conclusions

As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.

Lessons on what it takes to create a “gold standard” corpus to advance NLP application development.

What do you think the odds are of “high inter[author] agreement” in the absence of such planning and effort?

Sorry, I meant “high interannotator agreement.”

Guess we have to plan for “low inter[author] agreement.”

Suggestions?

Comments (1)

Gold Standard (or Bronze, Tin?)

Filed under: Bioinformatics,Biomedical,Corpora,Corpus Linguistics,Natural Language Processing — Patrick Durusau @ 4:33 pm

A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools by Karin M Verspoor, Kevin B Cohen, Arrick Lanfranchi, Colin Warner, Helen L Johnson, Christophe Roeder, Jinho D Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A Baumgartner, Michael Bada, Martha Palmer and Lawrence E Hunter. BMC Bioinformatics 2012, 13:207 doi:10.1186/1471-2105-13-207.

Abstract:

Background

We introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.

Results

Many biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.

Conclusions

The finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.

This is the article that I discovered and then worked my way to it from BioNLP.

Important as a deeply annotated text corpus.

But also a reminder that human annotators created the “gold standard,” against which other efforts are judged.

If you are ill, do you want gold standard research into the medical literature (which involves librarians)? Or is bronze or tin standard research good enough?

PS: I will be going back to pickup the other resources as appropriate.

Comments Off

CRAFT: THE COLORADO RICHLY ANNOTATED FULL TEXT CORPUS

Filed under: Bioinformatics,Biomedical,Corpora,Natural Language Processing — Patrick Durusau @ 3:41 pm

CRAFT: THE COLORADO RICHLY ANNOTATED FULL TEXT CORPUS

From the Quick Facts:

67 full text articles

>560,000 Tokens

>21,000 Sentences

~100,000 concept annotations to 7 different biomedical ontologies/terminologies

Chemical Entities of Biological Interest (ChEBI)

Cell Type Ontology (CL)

Entrez Gene

Gene Ontology (biological process, cellular component, and molecular function)

NCBI Taxonomy

Protein Ontology

Sequence Ontology

Penn Treebank markup for each sentence

Multiple output formats available

Let’s see: 67 articles resulted in 100,000 concept annotations, or about 1,493 per article for seven (7) ontologies/terminologies.

Ready to test this mapping out in your topic map application?

Comments Off

BioNLP-Corpora

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 2:46 pm

BioNLP-Corpora

From the webpage:

BioNLP-Corpora is a repository of biologically and linguistically annotated corpora and biological datasets.

It is one of the projects of the BioNLP initiative by the Center for Computational Pharmacology at the University of Colorado Denver Health Sciences Center to create and distribute code, software, and data for applying natural language processing techniques to biomedical texts.

There are many resources available for download at BioNLP-Corpora:

CRAFT: Colorado Richly Annotated Full-Text Corpus
Protein Residue Corpora: several corpora relevant to extraction of protein residues in text
PICorpus: Protein Interaction Corpus, a corpus of annotated protein-protein interactions
GeneHomonym: Gene identifier values that mean different things or refer to multiple genes by inference
The Annotation Projects: human annotated text on biological subject matter
The Medline Mining Projects: automatically mined data from Medline
The Anaphora Corpus: sample of GeneRIFs annotated with pronominal anaphora and their antecedents
Test Suite Corpora: structured test suites for natural language
processing applications

Like the guy says in the original Star Wars, “…almost there….”

In addition to being really useful resources, i am following a path that arose from the discovery of one resource.

One more website and then the article I found that lead to all the BioNLP* resources.

Comments Off

BioNLP

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 2:39 pm

BioNLP

From the homepage (worth repeating in full):

BioNLP is an initiative by the Center for Computational Pharmacology at the University of Colorado Denver Health Sciences Center to create and distribute code, software, and data for applying natural language processing techniques to biomedical texts. There are many projects associated with BioNLP.

Projects

BioLemmatizer: a biomedical literature specific lemmatizer.

BioNLP-Corpora: a repository of biologically and linguistically annotated corpora and biomedical datasets. This project includes

Colorado Richly Annotated Full-Text Corpus (CRAFT)

PICorpus

GeneHomonym

Annotation Projects

MEDLINE Mining projects

Anaphora Corpus

TestSuite Corpora

BioNLP-UIMA: Unstructured Information Management Architecture (UIMA) components geared towards the use and evaluation of tools for biomedical natural language processing, including tools for our own OpenDMAP and MutationFinder use.

common: a library of utility code for common tasks

Knowtator: a Protege plug-in for text annotation.

medline-xml-parser: a code library containing an XML parser for the 2012 Medline XML distribution format

MutationFinder: an information extraction system for extracting descriptions of point mutations from free text.

OboAnalyzer: an analysis tool to detect OBO ontology terms that use different linguistic conventions for expressing similar semantics.

OpenDMAP: an ontology-driven, rule-based concept analysis and information extraction system

Parentheses Classifier: a classifier for the content of parenthesized text

Simple Semantic Classifier: a text classifier for OBO domains

uima-shims: a library of simple interfaces designed to facilitate the development of type-system-independent UIMA components

Comments (1)

Analogies – Romans, Train, Dabbawalla, Oil Refinery, Laundry – and Hadoop

Filed under: Hadoop — Patrick Durusau @ 2:20 pm

Analogies – Romans, Train, Dabbawalla, Oil Refinery, Laundry – and Hadoop

Scroll down for the laundry analogy. It’s at least as accurate and more entertaining than the IBM analogy. 😉

I encountered this at Hadoopsphere.com.

Comments Off

finding names in common crawl

Filed under: Common Crawl,Natural Language Processing — Patrick Durusau @ 1:34 pm

finding names in common crawl by Mat Kelcey.

From the post:

the central offering from common crawl is the raw bytes they’ve downloaded and, though this is useful for some people, a lot of us just want the visible text of web pages. luckily they’ve done this extraction as a part of post processing the crawl and it’s freely available too!

If you don’t know “common crawl,” now would be a good time to meet the project.

From their webpage:

Common Crawl is a non-profit foundation dedicated to building and maintaining an open crawl of the web, thereby enabling a new wave of innovation, education and research.

Mat gets you started by looking for names in the common crawl data set.

Comments Off

Java for graphics cards

Filed under: GPU,Java — Patrick Durusau @ 1:22 pm

Java for graphics cards

From the post:

Phil Pratt-Szeliga, a postgraduate at Syracuse University in New York, has released the source code of his Rootbeer GPU compiler on Github. The developer presented the software at the High Performance Computing and Communication conference in Liverpool in June. The slides from this presentation can be found in the documentation section of the Github directory.
…

Short summary of Phil Pratt-Szeliga’s GPU compiler.

Is it a waste to have GPU cycles lying around or is there some more fundamental issue at stake?

To what degree does chip architecture drive choices at higher levels of abstraction?

Suggestions of ways to explore that question?

Comments Off

August 18, 2012

Clockwork Raven uses humans to crunch your Big Data

Filed under: BigData,Mechanical Turk,Tweets — Patrick Durusau @ 4:21 pm

Clockwork Raven uses humans to crunch your Big Data (Powered by an army of twits) by Elliot Bentley.

Twitter, the folks with the friendly API, ;-), have open sourced Clockwork Raven, a Twitter based means to upload small tasks to Mechanical Turk.

You can give users a full topic map editing/ontology creation tool (and train them in its use) or, you can ask very precise questions and crunch the output.

Not appropriate for every task but I suspect good enough for a number of them.

Comments Off

Creating Your First HTML 5 Web Page [HTML5 – Feature Freeze?]

Filed under: HTML,HTML5,WWW — Patrick Durusau @ 4:06 pm

Creating Your First HTML 5 Web Page by Michael Dorf.

From the post:

Whether you have been writing web pages for a while or you are new to writing HTML, the new HTML 5 elements are still within your reach. It is important to learn how HTML 5 works since there are many new features that will make your pages better and more functional. Once you get your first web page under your belt you will find that they are very easy to put together and you will be on your way to making many more.

To begin, take a look at this base HTML page we will be working with. This is just a plain-ol’ HTML page, but we can start adding HTML5 elements to jazz it up!

But that’s not why I am posting it here. 😉

A little later Michael says:

The new, simple DOCTYPE is much easier to remember and use than previous versions. The W3C is trying to stop versioning HTML so that backwards compatibility will become easier, so there are “technically” no more versions of HTML.

I’m not sure I follow on “…to stop versioning HTML so that backwards compatibility will become easier….”

Unless that means that HTML (5 I assume) is going into a feature/semantic freeze?

That would promote backwards compatibility but I am not sure is a good solution.

Just curious if you have heard the same?

Comments?

Comments Off

Does Time Fix All? [And my response]

Filed under: Librarian/Expert Searchers,Library,WWW — Patrick Durusau @ 3:51 pm

Does Time Fix All? by Daniel Lemire, starts off:

As an graduate, finding useful references was painful. What the librarians had come up with were terrible time-consuming systems. It took an outsider (Berners-Lee) to invent the Web. Even so, the librarians were slow to adopt the Web and you could often see them warn students against using the Web as part of their research. Some of us ignored them and posted our papers online, or searched for papers online. Many, many years later, we are still a crazy minority but a new generation of librarians has finally adopted the Web.

What do you conclude from this story?

Whenever you point to a difficult systemic problem (e.g., it is time consuming to find references), someone will reply that “time fixes everything”. A more sophisticated way to express this belief is to say that systems are self-correcting.
…

Here is my response:

From above: “… What the librarians had come up with were terrible time-consuming systems. It took an outsider (Berners-Lee) to invent the Web….”

Really?

You mean the librarians who had been working on digital retrieval since the late 1940’s and subject retrieval longer than that? Those librarians?

With the web, every user repeats the search effort of others. Why isn’t repeating the effort of others a “terrible time-consuming system?”

BTW, Berners-Lee invented allowing 404s for hyperlinks. Significant because it lowered the overhead of hyperlinking enough to be practical. It was other CS types with high overhead hyperlinking. Not librarians.

Berners-Lee fixed hyperlinking maintenance, failed and continues to fail on IR. Or have you not noticed?

I won’t amplify my answer here but will wait to see what happens to my comment at Daniel’s blog.

Comments Off

Thesauri (Vocabularies – TemaTres)

Filed under: Thesaurus,Vocabularies — Patrick Durusau @ 2:37 pm

Thesauri (Vocabularies – TemaTres)

The TemaTres vocabulary server is important but even more so is its collection of one hundred and fifty vocabularies.

Send a note if you export your vocabulary to a topic map. Interested in examples of mappings between vocabularies.

Comments Off

TemaTres: the open source vocabulary server

Filed under: Thesaurus,Vocabularies — Patrick Durusau @ 2:27 pm

TemaTres: the open source vocabulary server

From the webpage:

This is the International site for examples and cases on TemaTres, an open source vocabulary server for manage controlled vocabularies, taxonomies and thesaurus.

In this site you can found some resources about tools for knowledge management on digital spaces, TemaTres examples and some hosted vocabularies.

Quick link:

Documentation

Known cases

Download TemaTres

Publish your TemaTres Vocabulary in WordPress

Integrate your TemaTres Vocabulary with Open Journal System

Integrate your TemaTres vocabulary with any web system with TemaTresView

Publish your vocabulary wih your template using Thesaurus Web Publisher

VisulVocabulary: Web aplication to create visual representations based on controlled vocabularies.

Feedback

Said to export to:

Skos-Core, Zthes, TopicMap, Dublin Core, MADS, BS8723-5, RSS, SiteMap, txt

Looking at the documentation now.

Separate post coming on vocabularies at this site.

I first saw this at Beyond Search.

Comments Off

August 17, 2012

Forwarding Without Repeating: Efficient Rumor Spreading in Bounded-Degree Graphs

Filed under: Graphs,Messaging,P2P — Patrick Durusau @ 7:34 pm

Forwarding Without Repeating: Efficient Rumor Spreading in Bounded-Degree Graphs by Vincent Gripon, Vitaly Skachek, and Michael Rabbat.

Abstract:

We study a gossip protocol called forwarding without repeating (FWR). The objective is to spread multiple rumors over a graph as efficiently as possible. FWR accomplishes this by having nodes record which messages they have forwarded to each neighbor, so that each message is forwarded at most once to each neighbor. We prove that FWR spreads a rumor over a strongly connected digraph, with high probability, in time which is within a constant factor of optimal for digraphs with bounded out-degree. Moreover, on digraphs with bounded out-degree and bounded number of rumors, the number of transmissions required by FWR is arbitrarily better than that of existing approaches. Specifically, FWR requires O(n) messages on bounded-degree graphs with n nodes, whereas classical forwarding and an approach based on network coding both require {\omega}(n) messages. Our results are obtained using combinatorial and probabilistic arguments. Notably, they do not depend on expansion properties of the underlying graph, and consequently the message complexity of FWR is arbitrarily better than classical forwarding even on constant-degree expander graphs, as n \rightarrow \infty. In resource-constrained applications, where each transmission consumes battery power and bandwidth, our results suggest that using a small amount of memory at each node leads to a significant savings.

Interesting work that may lead to topic maps in resource-constrained environments.

Comments Off

edX – Fall 2012

Filed under: CS Lectures — Patrick Durusau @ 5:02 pm

edX – Fall 2012

Six courses for Fall 2012 have been announced by edX:

I got the email announcement but the foregoing are clean links (no marketing trash tracking (MTT)).

Comments Off

Where to get a BZR tree of the latest MySQL releases

Filed under: MySQL,Oracle — Patrick Durusau @ 4:46 pm

Where to get a BZR tree of the latest MySQL releases by Stewart Smith.

Sometimes, being difficult can develop into its own reward. Not always appreciated when it arrives but a reward none the less.

Comments Off

Akiban Persistit

Filed under: Akiban Persistit — Patrick Durusau @ 4:40 pm

Akiban Persistit

From the webpage:

We have worked hard to make Akiban Persistit™ exceptionally fast, reliable, simple and lightweight. We hope you will enjoy learning more about it and using it.

Akiban Persistit is a key/value data storage library written in Java™. Key features include:

support for highly concurrent transaction processing with multi-version concurrency control

optimized serialization and deserialization mechanism for Java primitives and objects

multi-segment (compound) keys to enable a natural logical key hierarchy

support for long records (megabytes)

implementation of a persistent SortedMap

extensive management capability including command-line and GUI tools

This chapter briefly and informally introduces and demonstrates various Persistit features through examples. Subsequent chapters and the Javadoc API documentation provides a detailed reference guide to the product.

Jack Park sent a URL to a webinar on Persistit the other day.

What is your “transaction” strategy?

Comments Off

Marching Hadoop to Windows

Filed under: Excel,Hadoop,Microsoft — Patrick Durusau @ 3:59 pm

Marching Hadoop to Windows

From the post:

Bringing Hadoop to Windows and the two-year development of Hadoop 2.0 are two of the more exciting developments brought up by Hortonworks’s Cofounder and CTO, Eric Baldeschwieler, in a talk before a panel at the Cloud 2012 Conference in Honolulu.

(video omitted)

The panel, which was also attended by Baldeschwieler’s Cloudera counterpart Amr Awadallah, focused on insights into the big data world, a subject Baldeschwieler tackled almost entirely with Hadoop. The eighteen-minute discussion also featured a brief history of Hadoop’s rise to prominence, improvements to be made to Hadoop, and a few tips to enterprising researchers wishing to contribute to Hadoop.

“Bringing Hadoop to Windows,” says Baldeschwieler “turns out to be a very exciting initiative because there are a huge number of users in Windows operating system.” In particular, the Excel spreadsheet program is a popular one for business analysts, something analysts would like to see integrated with Hadoop’s database. That will not be possible until, as Baldeschwieler notes, Windows is integrated into Hadoop later this year, a move that will also considerably expand Hadoop’s reach.

However, that announcement pales in comparison to the possibilities provided by the impending Hadoop 2.0. “Hadoop 2.0 is a pretty major re-write of Hadoop that’s been in the works for two years. It’s now in usable alpha form…The real focus in Hadoop 2.0 is scale and opening it up for more innovation.” Baldeschwieler notes that Hadoop’s rise has been result of what he calls “a happy accident” where it was being developed by his Yahoo team for a specific use case: classifying, sorting, and indexing each of the URLs that were under Yahoo’s scope.

Integration of Excel and Hadoop?

Is that going to be echoes of Unix – The Hole Hawg?

Comments Off

In Maps We Trust

Filed under: Mapping,Maps — Patrick Durusau @ 3:29 pm

In Maps We Trust by James Cheshire.

From the post:

Of all the different types of data visualisation, maps* seem to have the best reputation. I think people are much less likely to trust a pie chart, for example, than a map. In a sense, this is amazing given that all maps are abstractions from reality. They can never tell the whole truth and are nearly all based on data with some degree of uncertainty that will vary over large geographic areas. An extreme interpretation of this view is that all maps are wrong- in which case we shouldn’t bother making them. A more moderate view (and the one I take) is that maps are never perfect so we need to create and use them responsibly – not making them at all would make us worse off. This responsibility criterion is incredibly important because of the high levels of belief people have in maps. You have to ask: What are the consequences of the map you have made? Now that maps are easier than ever to produce, they risk losing their lofty status as some of the most trusted data visualisations if those making them stop asking themselves this tough question.
…
*here I mean maps that display non-navigational data.

I posted a response over at Jame’s blog:

How do you identify “non-navigational data” in a map?

Your comment made me think of convention and some unconventional maps.

Any data rendered in relationship to other data can be used for “navigation.” Whether I intend to “navigate” as “boots on the ground” or between ideas.

Or to put it another way, who is to say what is or is not “non-navigational data?” The map maker or the reader/user of the map? Or what use is “better” for a map?

Great post!

Patrick

Curious, would you ask: “What are the consequences of the map you have made?”

Comments Off

Lucene.Net becomes top-level project at Apache

Filed under: .Net,C#,Lucene — Patrick Durusau @ 2:41 pm

Lucene.Net becomes top-level project at Apache

From the post:

Lucene.Net, the port of the Lucene search engine library to C# and .NET, has left the Apache incubator and is now a top-level project. The announcement on the project’s blog says that the Apache board voted unanimously to accept the graduation resolution. The vote confirms that Lucene.Net is healthy and that the development and governance of the project follows the tenets of the “Apache way”. The developers will now be moving the project’s resources from the current incubator site to the main apache.org site.

Various flavors of MS Windows account for 80% of all operating systems.

What is the target for your next topic map app? (With or without Lucene.Net.)

Comments Off

A CS Intro Do-Over?

Filed under: CS Lectures — Patrick Durusau @ 11:01 am

Intro Curriculum Update

Robert Harper describes (in part) the CS introduction “do-over” at Carnegie Mellon:

In previous posts I have talked about the new introductory CS curriculum under development at Carnegie Mellon. After a year or so of planning, we began to roll out the new curriculum in the Spring of 2011, and have by now completed the transition. As mentioned previously, the main purpose is to bring the introductory sequence up to date, with particular emphasis on introducing parallelism and verification. A secondary purpose was to restore the focus on computing fundamentals, and correct the drift towards complex application frameworks that offer the students little sense of what is really going on. (The poster child was a star student who admitted that, although she had built a web crawler the previous semester, she in fact has no idea how to build a web crawler.) A particular problem is that what should have been a grounding in the fundamentals of algorithms and data structures turned into an exercise in bureaucratic object-oriented nonsense, swamping the core content with piles of methodology of dubious value to beginning students. (There is a new, separate, upper-division course on oo methodology for students interested in this topic.) A third purpose was to level the playing field, so that students who had learned about programming on the street were equally as challenged, if not more so, than students without much or any such experience. One consequence would be to reduce the concomitant bias against women entering CS, many fewer of whom having prior computing experience than the men.

The solution was a complete do-over, jettisoning the traditional course completely, and starting from scratch. The most important decision was to emphasize functional programming right from the start, and to build on this foundation for teaching data structures and algorithms. Not only does FP provide a much more natural starting point for teaching programming, it is infinitely more amenable to rigorous verification, and provides a natural model for parallel computation. Every student comes to university knowing some algebra, and they are therefore familiar with the idea of computing by calculation (after all, the word algebra derives from the Arabic al jabr, meaning system of calculation). Functional programming is a generalization of algebra, with a richer variety of data structures and a richer set of primitives, so we can build on that foundation. It is critically important that variables in FP are, in fact, mathematical variables, and not some distorted facsimile thereof, so all of their mathematical intuitions are directly applicable. So we can immediately begin discussing verification as a natural part of programming, using principles such as mathematical induction and equational reasoning to guide their thinking. Moreover, there are natural concepts of sequential time complexity, given by the number of steps required to calculate an answer, and parallel time complexity, given by the data dependencies in a computation (often made manifest by the occurrences of variables). These central concepts are introduced in the first week, and amplified throughout the semester.
…

Competing CS courses present an unprecedented opportunity to compare and contrast teaching of CS materials.

And, opportunity to capture (and map) vocabulary shifts in the discipline.

Comments Off

August 16, 2012

Proximity Operators [LucidWorks]

Filed under: Lucene,LucidWorks,Query Language — Patrick Durusau @ 7:31 pm

Proximity Operators

From the webpage:

A proximity query searches for terms that are either near each other or occur in a specified order in a document rather than simply whether they occur in a document or not.

You will use some of these operators more than others but having a bookmark to the documentation will prove to be useful.

Comments Off

Get Yourself a Linked Data Piece of WorldCat to Play With

Filed under: Library,RDF,WorldCat — Patrick Durusau @ 7:26 pm

Get Yourself a Linked Data Piece of WorldCat to Play With by Richard Wallis.

From the post:

You may remember my frustration a couple of months ago, at being in the air when OCLC announced the addition of Schema.org marked up Linked Data to all resources in WorldCat.org. Those of you who attended the OCLC Linked Data Round Table at IFLA 2012 in Helsinki yesterday, will know that I got my own back on the folks who publish the press releases at OCLC, by announcing the next WorldCat step along the Linked Data road whilst they were still in bed.

The Round Table was an excellent very interactive session with Neil Wilson from the British Library, Emmanuelle Bermes from Centre Pompidou, and Martin Malmsten of the Nation Library of Sweden, which I will cover elsewhere. For now, you will find my presentation Library Linked Data Progress on my SlideShare site.

After we experimentally added RDFa embedded linked data, using Schema.org markup and some proposed Library extensions, to WorldCat pages, one the most often questions I was asked was where can I get my hands on some of this raw data?

We are taking the application of linked data to WorldCat one step at a time so that we can learn from how people use and comment on it. So at that time if you wanted to see the raw data the only way was to use a tool [such as the W3C RDFA 1.1 Distiller] to parse the data out of the pages, just as the search engines do.

So I am really pleased to announce that you can now download a significant chunk of that data as RDF triples. Especially in experimental form, providing the whole lot as a download would have bit of a challenge, even just in disk space and bandwidth terms. So which chunk to choose was a question. We could have chosen a random selection, but decided instead to pick the most popular, in terms of holdings, resources in WorldCat – an interesting selection in it’s own right.

To make the cut, a resource had to be held by more than 250 libraries. It turns out that almost 1.2 million fall in to this category, so a sizeable chunk indeed. To get your hands on this data, download the 1Gb gzipped file. It is in RDF n-triples form, so you can take a look at the raw data in the file itself. Better still, download and install a triplestore [such as 4Store], load up the approximately 80 million triples and practice some SPARQL on them.

That’s a nice sized collection of data. In any format.

But next to last sentence of the post reads:

As I say in the press release, posted after my announcement, we are really interested to see what people will do with this data.

Déjà vu?

I think I have heard that question asked with other linked data releases. You? Pointers?

I first saw this at SemanticWeb.com.

Comments Off

« Newer Posts — Older Posts »

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

August 20, 2012

August 19, 2012

August 18, 2012

August 17, 2012

August 16, 2012