Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

December 22, 2011

Opening Up the Domesday Book

Filed under: Census Data,Dataset,Domesday Book,Geographic Data — Patrick Durusau @ 7:38 pm

Opening Up the Domesday Book by Sam Leon.

From the post:

Domesday Book might be one of the most famous government datasets ever created. Which makes it all the stranger that it’s not freely available online – at the National Archives, you have to pay £2 per page to download copies of the text.

Domesday is pretty much unique. It records the ownership of almost every acre of land in England in 1066 and 1086 – a feat not repeated in modern times. It records almost every household. It records the industrial resources of an entire nation, from castles to mills to oxen.

As an event, held in the traumatic aftermath of the Norman conquest, the Domesday inquest scarred itself deeply into the mindset of the nation – and one historian wrote that on his deathbed, William the Conqueror regretted the violence required to complete it. As a historical dataset, it is invaluable and fascinating.

In my spare time, I’ve been working on making Domesday Book available online at Open Domesday. In this, I’ve been greatly aided by the distinguished Domesday scholar Professor John Palmer, and his geocoded dataset of settlements and people in Domesday, created with AHRC funding in the 1990s.

I guess it really is all a matter of perspective. I have never thought of the Domesday Book as a “government dataset….” 😉

Certainly would make an interesting basis for a chronological topic map tracing the ownership and fate of “…almost every acre of land in England….”

Drs. Wood & Seuss Explain RDF in Two Minutes

Filed under: RDF,Semantic Web,Semantics — Patrick Durusau @ 7:38 pm

Drs. Wood & Seuss Explain RDF in Two Minutes by Eric Franzon.

From the post:

“How would you explain RDF to my grandmother? I still don’t get it…” a student recently asked of David Wood, CTO of 3Roundstones. Wood was speaking to a class called “Linked Data Ventures” and was made up of students from the MIT Computer Science Department and the Sloan School of Business. He responded by creating a slide deck and subsequent video explaining the Resource Description Framework using the classic Dr. Seuss style of rhyming couplets and the characters Thing 1 and Thing 2.

I hope this student’s grandmother found this as enjoyable as I did. (Video after the jump).

This is a great explanation of RDF. You won’t be authoring RDF after the video but for you will have the basics.

Take this as a goad to come up with something similar for topic maps and other semantic technologies.

Riaknostic diagnostic tools for Riak

Filed under: Riak,TMCL — Patrick Durusau @ 7:38 pm

Riaknostic diagnostic tools for Riak

From the webpage:

Overview

Sometimes, things go wrong in Riak. How can you know what’s wrong? Riaknostic is here to help.

(example omitted)

Riaknostic, which is invoked via the above command, is a small suite of diagnostic checks that can be run against your Riak node to discover common problems and recommend how to resolve them. These checks are derived from the experience of the Basho Client Services Team as well as numerous public discussions on the mailing list, IRC room, and other online media.

Two things occur to me:

One, diagnostic checks are a good idea, particularly ones that can be extended by the community. Hopefully the error messages are more helpful than cryptic but I will have to try it out to find out.

Two, what diagnostics would you write in TMCL as general diagnostics on a topic map? How would you discover what constraints to write as diagnostics in TMCL?

ProgrammableWeb – New APIs

Filed under: Data Source,Mashups — Patrick Durusau @ 6:40 pm

70 New APIs: Google Affiliate Network, Visual Search and Mobile App Sales Tracking by Wendell Santos.

In a post dated 18 December 2011, ProgrammableWeb reports:

This week we had 70 new APIs added to our API directory including an audio fingerprinting service, sentiment analysis and analytics service, affiliate marketing network, mobile app sales tracking service, visual search service and an eCommerce service. In addition we covered a “mobile engagement” platform adding revenue analytics to their service. Below are more details on each of these new APIs.

I have a question: ProgrammableWeb lists 4657 APIs (as of 22 December 2011, about 6:30 PM East Coast time) with six (6) filters, Keywords, Category, Company, Protocols/Styles, Data Format, Date, Managed By. How easy/hard is that to use? Care to guess where the break point will come in terms of ease of use?

For example, choosing “government” as a category, results in 154 APIs. A result that is a very uneven listing from Liepzig city data to Brazilian election candidate information to words used in the U.S. Congress. Minimal organization by country would be nice.

December 21, 2011

Semantic Web Technologies and Social Searching for Librarians – No Buy

Filed under: Searching,Semantic Web,Social Media — Patrick Durusau @ 7:26 pm

Semantic Web Technologies and Social Searching for Librarians By Robin Fay and Michael Sauers.

I don’t remember recommending a no buy on any book on this blog, particularly one I haven’t read, but there is a first time for everything.

Yes, I haven’t read the book because it isn’t available yet.

How do I know to recommend no buy on Robin Fay and Michael Sauers’ “Semantic Web Technologies and Social Searching for Librarians”?

Let’s look at the evidence, starting with the overview:

There are trillions of bytes of information within the web, all of it driven by behind-the-scenes data. Vast quantities of information make it hard to find what’s really important. Here’s a practical guide to the future of web-based technology, especially search. It provides the knowledge and skills necessary to implement semantic web technology. You’ll learn how to start and track trends using social media, find hidden content online, and search for reusable online content, crucial skills for those looking to be better searchers. The authors explain how to explore data and statistics through WolframAlpha, create searchable metadata in Flickr, and give meaning to data and information on the web with Google’s Rich Snippets. Let Robin Fay and Michael Sauers show you how to use tools that will awe your users with your new searching skills.

So, having read this book, you will know:

  • the future of web-based technology, especially search
  • [the] knowledge and skills necessary to implement semantic web technology
  • [how to] start and track trends using social media
  • [how to] find hidden content online
  • [how to] search for reusable online content
  • [how to] explore data and statistics through WolframAlpha
  • [how to] create searchable metadata in Flickr
  • [how to] give meaning to data and information on the web with Google’s Rich Snippets

The other facts you need to consider?

6 x 9 | 125 pp. | $59.95

So, in 125 pages, call it 105, allowing for title page, table of contents and some sort of index, you are going to learn all those skills?

For about the same amount of money, you can get a copy of Modern information retrieval : the concepts and technology behind search by Ricardo Baeza-Yates; Berthier Ribeiro-Neto, which covers only search in 944 pages.

I read a lot of discussion about teaching students to critically evaluate information that they read on the WWW.

Any institution that buys this book needs to implement critical evaluation of information training for its staff/faculty.

Lily 1.1 is out!

Filed under: Lily,NoSQL,Solr — Patrick Durusau @ 7:24 pm

Lily 1.1 is out

There is a lot to see here but I wanted to call your attention to:

Lily adds a high-level data model on top of HBase. Originally, the model was a simple list of fields stored within records, but we added some field types making that model a whole lot more interesting. A first addition is the RECORD value type. You can now store records inside records, which is useful to store structured data in fields. For indexing purposes, you can address sub-record data as if it are linked records, using dereferencing.

Is it just me or does it seem like a lot of software is being released just before the holidays? 😉

From the post:

Complex Field Types

Lily adds a high-level data model on top of HBase. Originally, the model was a simple list of fields stored within records, but we added some field types making that model a whole lot more interesting. A first addition is the RECORD value type. You can now store records inside records, which is useful to store structured data in fields. For indexing purposes, you can address sub-record data as if it are linked records, using dereferencing.

Two other cool new value types are LIST and PATH, which allow for far more flexible modeling than the previous multi-value and hierarchy field properties. At the schema level, we adopted a generics style of defining value types, for instance LIST<LIST<STRING>> defines a field that will contain a list of lists of strings. Finally, we also added a BYTEARRAY value type for raw data storage.

Conditional updates

If you’re familiar with multi-user environments you sure know about the problem of concurrent updates. For these situations, Lily now provides a lock-free, optimistic concurrency control feature we call conditional updates. The update and delete methods allow one to add a list of mutation conditions that need to be satisfied before the the update or delete will be applied.

For concurrency control, you can require that the value of a field needs to be the same as when the record was read before the update.

Test framework

Lily 1.1 ships with a toolchest for Java developers that want to run unit tests against an HBase/Lily application stack. The stack can be launched embedded or externally, with simple scripts straight out of the Lily distribution. You can also request a ‘state reset’, clearing a single node instance of Lily for subsequent test runs. Yes, you can now run Lily, HBase, Zookeeper, HDFS, Map/Reduce and Solr in a single VM, with a single command.

Server-side plugins

For the fearless Lily repository hacker, we offer two hooks to expand functionality of the Lily server process. There’s decorators which can intercept any CRUD operation for pre- or post-execution of side-effect operations (like modifying a field value before actually committing it).

Rowlog sharding

The global rowlog queue is now distributed across a pre-split table, with inserts and deletes going to several region servers. This will lead to superior performance on write-or update-heavy multi-node cluster setups.

API improvements

Our first customers (*waves to our French friends*) found our API to be a tad too verbose and suggested a Builder pattern approach. We listened and unveil a totally new (but optional) method-chaining Builder API for the Java API users.

Whirr-based cluster installer

For Lily Enterprise customers, we rewrote our cluster installer using Apache Whirr, being one of the first serious adopters of this exciting Cloud- and cluster management tool. Using this, installing Lily on many nodes becomes a breeze. Here’s a short movie showing off the new installer.

Performance

Thanks to better parallelization, Lily has become considerably faster. You can now comfortably throw more clients at one Lily cluster and see combined throughput scale fast.

All in all, Lily 1.1 was a great release to prepare. We hope you have as much fun using Lily 1.1 as we had building it. Check it out here: www.lilyproject.org.

Three New Splunk Developer Platform Offerings

Filed under: Java,Javascript,Python,Splunk — Patrick Durusau @ 7:24 pm

Three New Splunk Developer Platform Offerings

From the post:

Last week was a busy week for the Splunk developer platform team. We pushed live 2 SDKs within one hour! We are excited to announce the release of:

  • Java SDK Preview on GitHub. The Java SDK enables our growing base of customers to share and harness the core Splunk platform and the valuable data stored in Splunk across the enterprise. The SDK ships with a number of examples including an explorer utility that provides the ability to explore the components and configuration settings of a Splunk installation. Learn more about the Java SDK.
  • JavaScript SDK Preview on GitHub The JavaScript SDK takes big data to the web by providing developers with the ability to easily integrate visualizations into custom applications. Now developers can take the timeline view and charting capabilities of Splunk’s out-of-the-box web interface and include them in their custom applications. Additionally, with node.js support on the server side, developers can build end-to-end applications completely in JavaScript. Learn more about the JavaScript SDK.
  • Splunk Developer AMI. A developer-focused publicly available Linux Amazon Machine Image (AMI) that includes all the Splunk SDKs and Splunk 4.2.5. The Splunk Developer AMI, will make it easier for developers to try the Splunk platform. To enhance the usability of the image, developers can sign up for a free developer license trial, which can be used with the AMI. Read our blog post to learn more about the developer AMI.

The delivery of the Java and JavaScript SDKs coupled with our existing Python SDK (GitHub) reinforce our commitment to developer enablement by providing more language choice for application development and putting the SDKs on the Splunk Linux AMI expedites the getting started experience.

We are seeing tremendous interest in our developer community and customer base for Splunk to play a central role facilitating the ability to build innovative applications on top of a variety of data stores that span on-premises, cloud and mobile.

We are enabling developers to build complex Big Data applications for a variety of scenarios including:

  • Custom built visualizations
  • Reporting tool integrations
  • Big Data and relational database integrations
  • Complex event processing

Not to mention being just in time for the holidays! 😉

Seriously, tools to do useful work with “big data” are coming online. The question is going to be the skill with which they are applied.

Thoughts on ICDM (the IEEE conference on Data Mining)

Filed under: Data Mining,Graphs,Social Networks — Patrick Durusau @ 7:24 pm

Thoughts on ICDM I: Negative Results (part A) by Suresh Venkatasubramanian.

From (part A):

I just got back from ICDM (the IEEE conference on Data Mining). Data mining conferences are quite different from theory conferences (and much more similar to ML or DB conferences): there are numerous satellite events (workshops, tutorials and panels in this case), many more people (551 for ICDM, and that’s on the smaller side), and a wide variety of papers that range from SODA-ish results to user studies and industrial case studies.

While your typical data mining paper is still a string of techniques cobbled together without rhyme or reason (anyone for spectral manifold-based correlation clustering with outliers using MapReduce?), there are some general themes that might be of interest to an outside viewer. What I’d like to highlight here is a trend (that I hope grows) in negative results.

It’s not particularly hard to invent a new method for doing data mining. It’s much harder to show why certain methods will fail, or why certain models don’t make sense. But in my view, the latter is exactly what the field needs in order to give it a strong inferential foundation to build on (I’ll note here that I’m talking specifically about data mining, NOT machine learning – the difference between the two is left for another post).

From (part B):

Continuing where I left off on the idea of negative results in data mining, there was a beautiful paper at ICDM 2011 on the use of Stochastic Kronecker graphs to model social networks. And in this case, the key result of the paper came from theory, so stay tuned !

One of the problems that bedevils research in social networking is the lack of good graph models. Ideally, one would like a random graph model that evolves into structures that look like social networks. Having such a graph model is nice because

  • you can target your algorithms to graphs that look like this, hopefully making them more efficient
  • You can re-express an actual social network as a set of parameters to a graph model: it compacts the graph, and also gives you a better way of understanding different kinds of social networks: Twitter is a (0.8, 1, 2.5) and Facebook is a (1, 0.1, 0.5), and so on.
  • If you’re lucky, the model describes not just reality, but how it forms. In other words, the model captures the actual social processes that lead to the formation of a social network. This last one is of great interest to sociologists.

But there aren’t even good graph models that capture known properties of social networks. For example, the classic Erdos-Renyi (ER) model of a random graph doesn’t have the heavy-tailed degree distribution that’s common in social networks. It also doesn’t have a property that’s common to large social networks: densification, or the fact that even as the network grows, the diameter stays small (implying that the network seems to get denser over time).

Part C – forthcoming –

I am perhaps more sceptical of modeling than the author but this is a very readable and interesting set of blog posts. I will be posting Part C as soon as it appears.

Update: Thoughts on ICDM I: Negative results (part C)

From Part C:

If you come up with a better way of doing classification (for now let’s just consider classification, but these remarks apply to clustering and other tasks as well), you have to compare it to prior methods to see which works better. (note: this is a tricky problem in clustering that my student Parasaran Raman has been working on: more on that later.).

The obvious way to compare two classification methods is how well they do compared to some ground truth (i.e labelled data), but this is a one-parameter system, because by changing the threshold of the classifier (or if you like, translating the hyperplane around),you can change the false positive and false negative rates.

Now the more smug folks reading these are waiting with ‘ROC’ and “AUC” at the tip of their tongues, and they’d be right ! You can plot a curve of the false positive vs false negative rate and take the area under the curve (AUC) as a measure of the effectiveness of the classifier.

For example, if the y axis measured increase false negatives, and the x-axis measured increasing false positives, you’d want a curve that looked like an L with the apex at the origin, and a random classifier would look like the line x+y = 1. The AUC score would be zero for the good classifier and 0.5 for the bad one (there are ways of scaling this to be between 0 and 1).

The AUC is a popular way of comparing methods in order to balance the different error rates. It’s also attractive because it’s parameter-free and is objective: seemingly providing a neutral method for comparing classifiers independent of data sets, cost measures and so on.

But is it ?

Self-Index based on LZ77 (thesis)

Filed under: Indexing,LZ77 — Patrick Durusau @ 7:23 pm

Self-Index based on LZ77 (thesis) by Sebastian Kreft, Gonzalo Navarro (advisor).

Abstract:

Domains like bioinformatics, version control systems, collaborative editing systems (wiki), and others, are producing huge data collections that are very repetitive. That is, there are few differences between the elements of the collection. This fact makes the compressibility of the collection extremely high. For example, a collection with all different versions of a Wikipedia article can be compressed up to the 0.1% of its original space, using the Lempel-Ziv 1977 (LZ77) compression scheme.

Many of these repetitive collections handle huge amounts of text data. For that reason, we require a method to store them efficiently, while providing the ability to operate on them. The most common operations are the extraction of random portions of the collection and the search for all the occurrences of a given pattern inside the whole collection.

A self-index is a data structure that stores a text in compressed form and allows to find the occurrences of a pattern efficiently. On the other hand, self-indexes can extract any substring of the collection, hence they are able to replace the original text. One of the main goals when using these indexes is to store them within main memory.

In this thesis we present a scheme for random text extraction from text compressed with a Lempel-Ziv parsing. Additionally, we present a variant of LZ77, called LZ-End, that efficiently extracts text using space close to that of LZ77.

The main contribution of this thesis is the first self-index based on LZ77/LZ-End and oriented to repetitive texts, which outperforms the state of the art (the RLCSA self-index) in many aspects. Finally, we present a corpus of repetitive texts, coming from several application domains. We aim at providing a standard set of texts for research and experimentation, hence this corpus is publicly available.

Despite the world economic woes, instabiilty in a number of places, something comes along that looks very promising and makes my day! Will take a while to read but looks quite promising.

Opaque Attribute Alignment

Filed under: Mapping,Ontology — Patrick Durusau @ 7:22 pm

Opaque Attribute Alignment by Jennifer Sleeman, Rafael Alonso, Hua Li, Art Pope, and Antonio Badia.

Abstract:

Ontology alignment describes a process of mapping ontological concepts, classes and attributes between different ontologies providing a way to achieve interoperability. While there has been considerable research in this area, most approaches that rely upon the alignment of attributes use label based string comparisons of property names. The ability to process opaque or non-interpreted attribute names is a necessary component of attribute alignment. We describe a new attribute alignment approach to support ontology alignment that uses the density estimation as a means for determining alignment among objects. Using the combination of similarity hashing, Kernel Density Estimation (KDE) and Cross entropy, we are able to show promising F-Measure scores using the standard Ontology Alignment Evaluation Initiative (OAEI) 2011 benchmark.

Just in case you run across different ontologies covering the same area, however unlikely that seems 10+ years after the appearance of the Semantic Web.

UseR! 2011 slides and videos – on one page

Filed under: Conferences,R,Statistics — Patrick Durusau @ 7:21 pm

UseR! 2011 slides and videos – on one page

From the post:

I was recently reminded that the wonderful team at warwick University made sure to put online many of the slides (and some videos) of talks from the recent useR 2011 conference. You can browse through the talks by going between the timetables (where it will be the most updated, if more slides will be added later), but I thought it might be more convenient for some of you to have the links to all the talks (with slides/videos) in one place.

I am grateful for all of the wonderful people who put their time in making such an amazing event (organizers, speakers, attendees), and also for the many speakers who made sure to share their talk/slides online for all of us to reference. I hope to see this open-slides trend will continue in the upcoming useR conferences…

Just in case you get a new R book over the holidays or even if you don’t, this is an amazing set of presentations. From business forecasting and medical imaging to social networks and modeling galaxies, something for everyone.

This looks like a very entertaining conference. Will watch for the announcement of next year’s conference.

MIT launches online learning initiative

Filed under: Education — Patrick Durusau @ 7:21 pm

MIT launches online learning initiative

From the post:

MIT today announced the launch of an online learning initiative internally called “MITx.” MITx will offer a portfolio of MIT courses through an online interactive learning platform that will:

  • organize and present course material to enable students to learn at their own pace
  • feature interactivity, online laboratories and student-to-student communication
  • allow for the individual assessment of any student’s work and allow students who demonstrate their mastery of subjects to earn a certificate of completion awarded by MITx
  • operate on an open-source, scalable software infrastructure in order to make it continuously improving and readily available to other educational institutions.

MIT expects that this learning platform will enhance the educational experience of its on-campus students, offering them online tools that supplement and enrich their classroom and laboratory experiences. MIT also expects that MITx will eventually host a virtual community of millions of learners around the world.

You may also be interested in What is MITx?, an faq that accompanied the press release.

It would be interesting to see the framework they release to be used to host short courses/training on Lucene, Hadoop, R, bigdata(R), topic maps, etc.

Reusable TokenStreams

Filed under: Lucene,Text Analytics — Patrick Durusau @ 7:21 pm

Reusable TokenStreams by Chris Male.

Abstract:

This white paper covers how Lucene’s text analysis system works today and explores the system and provides an understanding of what a TokenStream is, what the difference between Analyzers, TokenFilters and Tokenizers are, and how reuse impacts the design and implementation of each of these components.

Useful treatment of Lucene’s text analysis features. Those are still developing and more changes are promised (but left rather vague) for the future.

One feature that is covered of particular interest was the ability to associate geographic location data with terms deemed to represent locations.

Occurs to me that such a feature could also be used to annotate terms during text analysis to associate subject identifiers with those terms.

An application doesn’t have to “understand” that terms have different meanings so long as it can distinguish one from another based on annotations. (Or map them together despite different identifiers.)

December 20, 2011

The R Journal, volume 3/2

Filed under: R — Patrick Durusau @ 8:28 pm

The R Journal, volume 3/2 (PDF file)

The R Journal homepage.

How Twitter Stores 250 Million Tweets a Day Using MySQL

Filed under: Design,MySQL — Patrick Durusau @ 8:27 pm

How Twitter Stores 250 Million Tweets a Day Using MySQL

From the post:

Jeremy Cole, a DBA Team Lead/Database Architect at Twitter, gave a really good talk at the O’Reilly MySQL conference: Big and Small Data at @Twitter, where the topic was thinking of Twitter from the data perspective.

One of the interesting stories he told was of the transition from Twitter’s old way of storing tweets using temporal sharding, to a more distributed approach using a new tweet store called T-bird, which is built on top of Gizzard, which is built using MySQL.

OK, so your Christmas wish wasn’t for a topic map with quite that level of input everyday. 😉 You can still learn something about design of a robust architecture from this presentation.

Lucene today, tomorrow and beyond

Filed under: Lucene — Patrick Durusau @ 8:26 pm

Lucene today, tomorrow and beyond

Presentation by Simon Willnauer, mostly about what Lucene doesn’t do or do well today. Suggestions for possible evolution of Lucene but the direction depends on the community. Exciting times look like they are going to continue!

Neo4j 1.6 M02 “Jörn Kniv” Brings Heroku Support

Filed under: Heroku,Neo4j — Patrick Durusau @ 8:25 pm

Neo4j 1.6 M02 “Jörn Kniv” Brings Heroku Support

From the post:

We have another milestone for you – 1.6.M02. As I’ve written before, we’re heavily into improving our infrastructure – our build, stress testing etc. But we have more: Faster and better Cypher and open beta on Heroku!

Heroku Public Beta

Our private beta on Heroku was going along just fine. We were getting positive feedback, tweaking provisioning and monitoring, and starting to feel comfortable about stepping into the cloud. Then Peter Neubauer showcased a great demo last week on how to get up and running on Heroku with Neo4j, topping it off with a Google Spreadsheets front-end. This clever hackery was even featured in the Heroku December newsletter.

It seemed time to finally allow other people to join the fun. So, we’re pleased to announce that the Neo4j Add-on is now in public beta on Heroku.

Way cool! Send a note of appreciation to the Neo4j team!

PURDUE Machine Learning Summer School 2011

Filed under: Machine Learning — Patrick Durusau @ 8:25 pm

PURDUE Machine Learning Summer School 2011

The coverage of the summer school is very impressive. The lecture titles and presenters were:

  • Machine Learning for Statistical Genetics by Karsten Borgwardt
  • Large-scale Machine Learning and Stochastic Algorithms by Leon Bottou
  • Divide and Recombine (D&R) for the Analysis of Big Data by William S. Cleveland
  • Privacy Issues with Machine Learning: Fears, Facts, and Opportunities by Chris Clifton
  • The MASH project. An open platform for the collaborative development of feature extractors by Francois Fleuret
  • Techniques for Massive-Data Machine Learning, with Application to Astronomy by Alex Gray
  • Mining Heterogeneous Information Networks by Jiawei Han
  • Machine Learning for a Rainy Day by Sergey Kirshner
  • Machine Learning for Discovery in Legal Cases by David D. Lewis
  • Classic and Modern Data Clustering by Marina Meilă
  • Modeling Complex Social Networks: Challenges and Opportunities for Statistical Learning and Inference by Jennifer Neville
  • Using Heat for Shape Understanding and Retrieval by Karthik Ramani
  • Learning Rhythm from Live Music by Christopher Raphael
  • Introduction to supervised, unsupervised and partially-supervised training algorithms by Dale Schuurmans
  • A Machine Learning Approach for Complex Information Retrieval Applications by Luo Si
  • A Short Course on Reinforcement Learning by Satinder Singh Baveja
  • Graphical Models for the Internet by Alexander Smola
  • Optimization for Machine Learning by S V N Vishwanathan
  • Survey of Boosting from an Optimization Perspective by Manfred K. Warmuth

Now that would be a summer school to remember!

Standard Measures in Genomic Studies

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:25 pm

Standard Measures in Genomic Studies

This news story caught my eye and leads off simply enough:

Standards can make our lives better. We have standards for manufacturing many items — from car parts to nuts and bolts — that improve the reliability and compatibility of all sorts of widgets that we use in our daily lives. Without them, many tasks would be difficult, a bit like trying to fit a square peg into a round hole. There are even standard measures to collect information from participants of large population genomic studies that can be downloaded for free from the Consensus Measures for Phenotype and eXposures (PhenX) Toolkit [phenxtoolkit.org]. However, researchers will only adopt such standard measures if they can be used easily.

That is why the NHGRI’s Office of Population Genomics has launched a new effort called the PhenX Real-world Implementation and Sharing (PhenX RISING) program. The National Human Genome Research Institute (NHGRI) has awarded nearly $900,000, with an additional $100,000 from NIH Office of Behavioral and Social Sciences Research (OBSSR), to seven investigators to use and evaluate the standards. Each investigator will incorporate a variety of PhenX measures into their ongoing genome-wide association or large population study. These researchers will also make recommendations as to how to fine-tune the PhenX Toolkit.

OK, good for them, or at least the researchers who get the grants, but what does that have to do with topic maps?

Just a bit further the announcement says:

GWAS have identified more than a thousand associations between genetic variants and common diseases such as cancer and heart disease, but the majority of the studies do not share standard measures. PhenX standard measures are important because they allow researchers to more easily combine data from different studies to see if there are overlapping genetic factors between or among different diseases. This ability will improve researchers’ understanding of disease and may eventually be used to assess a patient’s genetic risk of getting a disease such as diabetes or cancer and to customize treatment.

OK, so there are existing studies that don’t share standard measures, there will be more studies while the PhenX RISING program goes on that don’t share standard measures and there may be future studies while PhenX RISING is being adjusted that don’t share standard measures.

Depending upon the nature of the measures that are not shared and the importance of mapping between these non-shared standards, this sounds like fertile ground for topic map prospecting.

Talking Glossary of Genetic Terms

Filed under: Bioinformatics — Patrick Durusau @ 8:23 pm

Talking Glossary of Genetic Terms

From the webpage:

The National Human Genome Research Institute (NHGRI) created the Talking Glossary of Genetic Terms to help everyone understand the terms and concepts used in genetic research. In addition to definitions, specialists in the field of genetics share their descriptions of terms, and many terms include images, animation and links to related terms.

Getting Started:

Enter a search term or explore the list of terms by selecting a letter from the alphabet on the left and then select from the terms revealed. (A text-only version is available from here.)

The Talking Glossary

At the bottom of most pages in the Talking Glossary are links to help you get the most out of this glossary.

Linked information explains how to cite a term from the Glossary in a reference paper. Another link allows you to suggest a term currently not in the glossary that you feel would be a valuable addition. And there is a link to email any of the 200+ terms to a friend.

Useful resource, particularly the links to additional information.

The Mapping Dilemma

Filed under: Clojure — Patrick Durusau @ 8:23 pm

The Mapping Dilemma by David Nolen.

Description:

Almost all problems of Computer Science can be summarized as being some form of mapping dilemma, whether that means taking the results of proof theory and constructing an expressive type system or taking a graphical user experience and constructing a scalable model-view-controller architecture. Finding solutions to these challenges can be incredibly rewarding, yet as implementations solidify it often turns out we didn’t solve the mapping dilemma at all! The beautiful generality of the abstract idea becomes lost in the brittle specificity of concrete implementations.

In this talk we’ll discuss a library that attempts to solve the mapping dilemma – an optimizing pattern match compiler in the spirit of OCaml and Haskell targeting the Clojure programming language. The library attempts to provide the exact same general abstraction whether the user wishes to pattern match persistent data structures or the individual bits in a byte.

This controversial talk critiques our fundamental choice of tools, programming languages, and software methodologies from the perspective of how well they help us solve mapping dilemmas.

Not what I usually think of as the “mapping dilemma” but an interesting and possibly quite useful presentation none the less.

bigdata®

Filed under: bigdata®,NoSQL — Patrick Durusau @ 8:23 pm

bigdata®

Bryan Thompson, one of the creators of bigdata(R), was a member of the effort that resulted in the XTM syntax for topic maps.

If Bryan says it scales, it scales.

What I did not see was the ability to document mappings between data as representing the same subjects. Or the ability to query such mappings. Still, on further digging I may uncover something that works that way.

From the webpage:

This is a major version release of bigdata(R). Bigdata is a horizontally-scaled, open-source architecture for indexed data with an emphasis on RDF capable of loading 1B triples in under one hour on a 15 node cluster. Bigdata operates in both a single machine mode (Journal) and a cluster mode (Federation). The Journal provides fast scalable ACID indexed storage for very large data sets, up to 50 billion triples / quads. The federation provides fast scalable shard-wise parallel indexed storage using dynamic sharding and shard-wise ACID updates and incremental cluster size growth. Both platforms support fully concurrent readers with snapshot isolation.

Distributed processing offers greater throughput but does not reduce query or update latency. Choose the Journal when the anticipated scale and throughput requirements permit. Choose the Federation when the administrative and machine overhead associated with operating a cluster is an acceptable tradeoff to have essentially unlimited data scaling and throughput.

See [1,2,8] for instructions on installing bigdata(R), [4] for the javadoc, and [3,5,6] for news, questions, and the latest developments. For more information about SYSTAP, LLC and bigdata, see [7].

Starting with the 1.0.0 release, we offer a WAR artifact [8] for easy installation of the single machine RDF database. For custom development and cluster installations we recommend checking out the code from SVN using the tag for this release. The code will build automatically under eclipse. You can also build the code using the ant script. The cluster installer requires the use of the ant script.

You can download the WAR from:

http://sourceforge.net/projects/bigdata/

You can checkout this release from:

https://bigdata.svn.sourceforge.net/svnroot/bigdata/tags/BIGDATA_RELEASE_1_1_0

New features:

  • Fast, scalable native support for SPARQL 1.1 analytic queries;
  • %100 Java memory manager leverages the JVM native heap (no GC);
  • New extensible hash tree index structure.

Feature summary:

– Single machine data storage to ~50B triples/quads (RWStore);

  • Clustered data storage is essentially unlimited;
  • Simple embedded and/or webapp deployment (NanoSparqlServer);
  • Triples, quads, or triples with provenance (SIDs);
  • Fast 100% native SPARQL 1.0 evaluation;
  • Integrated “analytic” query package;
  • Fast RDFS+ inference and truth maintenance;
  • Fast statement level provenance mode (SIDs).

Road map [3]:

  • Simplified deployment, configuration, and administration for clusters; and
  • High availability for the journal and the cluster.

(footnotes omitted)

PS: Jack Park forwarded this to my attention. Will have to download and play with it over the holidays.

Extreme Cleverness: Functional Data Structures in Scala

Filed under: Data Structures,Functional Programming,Scala — Patrick Durusau @ 8:22 pm

Extreme Cleverness: Functional Data Structure in Scala

From the description:

Daniel Spiewak shows how to create immutable data that supports structural sharing, such as: Singly-linked List, Banker’s Queue, 2-3 Finger Tree, Red-Black Tree, Patricia Trie, Bitmapped Vector Trie.

Every now and again I see a presentation that is head and shoulders above even very good presentations. This is one of those.

The coverage of the Bitmapped Vector Trie merits your close attention. Amazing performance characteristics.

Satisfy yourself, see: http://github.com/djspiewak/extreme-cleverness

December 19, 2011

NoSQL Screencast: Building a StackOverflow Clone With RavenDB

Filed under: NoSQL,RavenDB — Patrick Durusau @ 8:11 pm

NoSQL Screencast: Building a StackOverflow Clone With RavenDB

Ayenda and Justin cover:

  • Map/Reduce indexes
  • Modelling tags
  • Facets
  • Performance
  • RavenDB profiler

Entire project is on Github, just in case you want to review the code.

NoSQL Screencast: HBase Schema Design

Filed under: HBase,NoSQL — Patrick Durusau @ 8:11 pm

NoSQL Screencast: HBase Schema Design

From Alex Popescu’s post:

In this O’Reilly webcast, long time HBase developer and Cloudera HBase/Hadoop architect Lars George discusses the underlying concepts of the storage layer in HBase and how to do model data in HBase for best possible performance.

You may know George from HBase: The Definitive Guide.

OSCAR4

Filed under: Cheminformatics,Data Mining — Patrick Durusau @ 8:11 pm

OSCAR4 Launch

From the webpage:

OSCAR (Open Source Chemistry Analysis Routines) is an open source extensible system for the automated annotation of chemistry in scientific articles. It can be used to identify chemical names, reaction names, ontology terms, enzymes and chemical prefixes and adjectives. In addition, where possible, any chemical names detected will be annotated with structures derived either by lookup, or name-to-structure parsing using OPSIN[1] or with identifiers from the ChEBI (`Chemical Entities of Biological Interest’) ontology.

The current version of OSCAR. OSCAR4, focuses on providing a core library that facilitates integration with other tools. Its simple to use API is modularised to promote extension into other domains and allows for its use within workflow systems like Taverna[2] and U-Compare [3].

We will be hosting a launch on the 13th of April to discuss the new architecture as well as demonstrate some applications that use OSCAR. Tutorial sessions on on how to use the new API will also be provided.

Archived videos from the launch are now online: http://sms.cam.ac.uk/collection/1130934

Just to put this into a topic map context, imagine that the annotation in question was placement in an association with mappings to other data, data that was held by your employer and leased to researchers.

GENIA Project

Filed under: Bioinformatics — Patrick Durusau @ 8:11 pm

GENIA Project: Mining literature for knowledge in molecular biology.

From the webpage:

The GENIA project seeks to automatically extract useful information from texts written by scientists to help overcome the problems caused by information overload. We intend that while the methods are customized for application in the micro-biology domain, the basic methods should be generalisable to knowledge acquisition in other scientific and engineering domains.

We are currently working on the key task of extracting event information about protein interactions. This type of information extraction requires the joint effort of many sources of knowledge, which we are now developing. These include a parser, ontology, thesaurus and domain dictionaries as well as supervised learning models.

Be aware that the project uses the acronym of “TM” for “text mining.” Anyone can clearly see that “TM” should be expand to “topic map.” 😉 Just teasing.

GENIA has a corpus of texts and a number of tools for mining texts.

Visions of a semantic molecular future

Filed under: Bioinformatics,Biomedical — Patrick Durusau @ 8:10 pm

Visions of a semantic molecular future

I already have a post on the Journal of Cheminformatics but this looked like it needed a separate post.

This thematic issue arose from a symposium held in the Unilever Centre [for Molecular Science Informatics, Department of Chemistry, University of Cambridge] on 2011-01-15/17 to celebrate the career of Peter Murray-Rust. From the programme:

This symposium addresses the creativity of the maturing Semantic Web to the unrealized potential of Molecular Science. The world is changing and we are in the middle of many revolutions: Cloud computing; the Semantic Web; the Fourth Paradigm (data-driven science); web democracy; weak AI; pervasive devices; citizen science; Open Knowledge. Technologies can develop in months to a level where individuals and small groups can change the world. However science is hamstrung by archaic approaches to the publication, redistribution and re-use of information and much of the vision is (just) out of reach. Social, as well as technical, advances are required to realize the full potential. We’ve asked leading scientists to let their imagination explore the possible and show us how to get there.

This is a starting point for all of us – the potential of working with the virtual world of scientists and citizens, coordinated through organizations such as the Open Knowledge Foundation and continuing connection with the Cambridge academic community makes this one of the central points for my future.

The pages in this document represent vibrant communities of practice which are growing and are offered to the world as contributions to a bsemantic molecular future.

We have combined talks from the symposium with work from the Murray-Rust group into 15 articles.

Quickly, just a couple of the articles with abstracts to get you interested:

“Openness as infrastructure”
John Wilbanks Journal of Cheminformatics 2011, 3:36 (14 October 2011)

The advent of open access to peer reviewed scholarly literature in the biomedical sciences creates the opening to examine scholarship in general, and chemistry in particular, to see where and how novel forms of network technology can accelerate the scientific method. This paper examines broad trends in information access and openness with an eye towards their applications in chemistry.

“Open Bibliography for Science, Technology, and Medicine”
Richard Jones, Mark MacGillivray, Peter Murray-Rust, Jim Pitman, Peter Sefton, Ben O’Steen, William Waites Journal of Cheminformatics 2011, 3:47 (14 October 2011)

The concept of Open Bibliography in science, technology and medicine (STM) is introduced as a combination of Open Source tools, Open specifications and Open bibliographic data. An Openly searchable and navigable network of bibliographic information and associated knowledge representations, a Bibliographic Knowledge Network, across all branches of Science, Technology and Medicine, has been designed and initiated. For this large scale endeavour, the engagement and cooperation of the multiple stakeholders in STM publishing – authors, librarians, publishers and administrators – is sought.

It should be interesting when generally realized that the information people have hoarded over the years isn’t important. It is the human mind that perceives, manipulates, and draws conclusions from information that gives it any value at all.

« Newer PostsOlder Posts »

Powered by WordPress