Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 10, 2012

Visualization Tools for Understanding Big Data

Filed under: BigData,Mapping,Maps,Visualization — Patrick Durusau @ 10:01 am

Visualization Tools for Understanding Big Data by James Cheshire.

From the post:

I recently co-wrote an editorial (download the full version here) with Mike Batty (UCL CASA) in which we explored some of the current issues surrounding the visualisation of large urban datasets. We were inspired to write it following the CASA Smart Cities conference and we included a couple of visualisations I have blogged here. Much of the day was devoted to demonstrating the potential of data visualisation to help us better understand our cities. Such visualisations would not have been possible a few years ago using desktop computers their production has ballooned as a result of recent technological (and in the case of OpenData, political) advances.

In the editorial we argue that the many new visualisations, such as the map of London bus trips above, share much in common with the work of early geographers and explorers whose interests were in the description of often-unknown processes. In this context, the unknown has been the ability to produce a large-scale impression of the dynamics of London’s bus network. The pace of exploration is largely determined by technological advancement and handling big data is no different. However, unlike early geographic research, mere description is no longer a sufficient benchmark to constitute advanced scientific enquiry into the complexities of urban life. This point, perhaps, marks a distinguishing feature between the science of cities and the thousands of rapidly produced big data visualisations and infographics designed for online consumption. We are now in a position to deploy the analytical methods developed since geography’s quantitative revolution, which began half a century ago, to large datasets to garner insights into the process. Yet, many of these methods are yet to be harnessed for the latest datasets due to the rapidity and frequency of data releases and the technological limitations that remain in place (especially in the context of network visualisation). That said, the path from description to analysis is clearly marked and, within this framework, visualisation plays an important role in the conceptualisation of the system(s) of interest, thus offering a route into more sophisticated kinds of analysis.

Curious if you would say that topic maps as navigation artifacts are “descriptive” as opposed to “explorative?”

What would you suggest as a basis for “interactive” topic maps that present the opportunity for dynamic subject identification, associations and merging?

Intro to HBase Internals and Schema Design

Filed under: HBase,NoSQL,Schema — Patrick Durusau @ 9:21 am

Intro to HBase Internals and Schema Design by Alex Baranau.

You will be disappointed by the slide that reads:

HBase will not adjust cluster settings to optimal based on usage patterns automatically.

Sorry, but we just aren’t quite to drag-n-drop software that optimizes to arbitrary data without user intervention.

Not sure we could keep that secret from management very long in any case so perhaps all for the best.

Once you get over your chagrin at having to still work, a little anyone, you will find Alex’s presentation a high level peak at the internals of HBase. Should be enough to get you motivated to learn more on your own. Not guaranteeing that but that should be the average result.

Intro to HBase [Augmented Marketing Opportunities]

Filed under: HBase,NoSQL — Patrick Durusau @ 8:37 am

Intro to HBase by Alex Baranau.

Slides from a presentation Alex did on HBase for a meetup in New York City on HBase.

Fairly high level overview but one of the better ones.

Should leave you with a good orientation to HBase and its capabilities.

Just in case you are looking for a project, ;-), it would be interesting to point into a slide deck like this one with links into tutorial and documentation for the product.

Thinking of the old “hub” document concept from HyTime so you would not have to hard code links in the source but could update them as newer material comes along.

Just in case you need some encouragement, think of every slide deck as an augmented marketing opportunity. Where you are leveraging not only the presentation but the other documentation and materials created by your group.

The Hadoop Ecosystem, Visualized in Datameer

Filed under: Cloudera,Datameer,Hadoop,Visualization — Patrick Durusau @ 8:28 am

The Hadoop Ecosystem, Visualized in Datameer by Rich Taylor.

From the post:

In our last post, Christophe explained why Datameer uses D3.js to power our Business Infographic™ designer. I thought I would follow up his post showing how we visualized the Hadoop ecosystem connections. First using only D3.js, and second using Datameer 2.0.

Visualizations of the Hadoop Ecosystem are colorful, amusing, instructive, but probably not useful per se.

What is useful is the demonstration of that using Datameer 2.0 can drastically reduce the time required for you to make a visualization.

Which results in you having more time to explore and find visualizations that are useful as opposed to being visualizations for the sake of visualization.

We can all think of network (“hairball” was the technical term used in a paper I read recently) visualizations that would be useful if we were super-boy/girl but otherwise, not so much.

I first saw this at Cloudera.

Data Mining In Excel: Lecture Notes and Cases

Filed under: Data Mining,Excel — Patrick Durusau @ 7:51 am

Data Mining In Excel: Lecture Notes and Cases by Yanchang Zhao.

Table of contents (270 page book)

  • Overview of the Data Mining Process
  • Data Exploration and Dimension Reduction
  • Evaluating Classification and Predictive Performance
  • Multiple Linear Regression
  • Three Simple Classification Methods
  • Classification and Regression Trees
  • Logistic Regression
  • Neural Nets
  • Discriminant Analysis
  • Association Rules
  • Cluster Analysis

You knew that someday all those Excel files would be useful! 😉 Well, today may be the day!

A bit dated, 2005, but should be a good starting place.

If you are interested in learning data mining in Excel cold, try comparing the then capacities of Excel to the current version of Excel and updating the text/examples.

Best way to learn it is to update and then teach it to others.

GNU C++ hash_set vs STL std::set: my notebook

Filed under: Deduplication,Hashing,Sets — Patrick Durusau @ 7:35 am

GNU C++ hash_set vs STL std::set: my notebook by Pierre Lindenbaum.

Pierre compares the C++ template set of the C++ Standard Template library to the GNU non-standard hash-based set on a set of random numbers to insert/remove.

The results may surprise you.

Worth investigating if you are removing duplicates post-query.

MongoDB Installer for Windows Azure

Filed under: Azure Marketplace,Microsoft,MongoDB — Patrick Durusau @ 7:17 am

MongoDB Installer for Windows Azure by Doug Mahugh.

From the post:

Do you need to build a high-availability web application or service? One that can scale out quickly in response to fluctuating demand? Need to do complex queries against schema-free collections of rich objects? If you answer yes to any of those questions, MongoDB on Windows Azure is an approach you’ll want to look at closely.

People have been using MongoDB on Windows Azure for some time (for example), but recently the setup, deployment, and development experience has been streamlined by the release of the MongoDB Installer for Windows Azure. It’s now easier than ever to get started with MongoDB on Windows Azure!

If you are developing or considering developing with MongoDB, this is definitely worth a look. In part because it frees you to concentrate on software development and not running (or trying to run) a server farm. Different skill sets.

Another reason is that is levels the playing field with big IT firms with server farms. You get the advantages of a server farm without the capital investment in one.

And as Microsoft becomes a bigger and bigger tent for diverse platforms and technologies, you have more choices. Choices for the changing requirements of your clients.

Not that I expect to see an Apple hanging from the Microsoft tree anytime soon but you can’t ever tell. Enough consumer demand and it could happen.

In the meantime, while we wait for better games and commercials, consider how you would power semantic integration in the cloud?

July 9, 2012

Graphity source code and wikipedia raw data

Filed under: Graphity,Graphs,Neo4j,Wikipedia — Patrick Durusau @ 3:25 pm

Graphity source code and wikipedia raw data is online (neo4j based social news stream framework) René Pickhardt.

From the post:

8 months ago I posted the results of my research about fast retrieval of social news feeds and in particular my graph index graphity. The index is able to serve more than 12 thousand personalized social news streams per second in social networks with several million active users. I was able to show that the system is independent of the node degree or network size. Therefor it scales to graphs of arbitrary size.

Today I am pleased to anounce that our joint work was accepted as a full research paper at IEEE SocialCom conference 2012. The conference will take place in early September 2012 in Amsterdam. As promised before I will now open the source code of Graphity to the community. Its documentation could / and might be improved in future also I am sure that one is even able to use a better data structure for our implementation of the priority queue.

Still the attention from the developer community for Graphity was quite high so maybe the source code is of help to anyone. The source code consists of the entire evaluation framework that we used for our evaluation against other baselines which will also help anyone to reproduce our evaluation.

There is some nice things one can learn in setting up multthreading for time measurements and also how to set up a good logging mechanism.

Just in case you are interested in all the changes ever made to the German entries in Wikipedia.

That’s one use case. 😉

Deeply awesome work!

Please take a close look! This looks important!

Using Palantir to Explore Prescription Drug Safety

Filed under: Data Mining,Palantir — Patrick Durusau @ 2:58 pm

Using Palantir to Explore Prescription Drug Safety

From the post:

Drug safety is a serious concern in the United States with adverse drug events contributing to over 770,000 injuries and deaths per year. Cost estimates range from $1.5 to $5.6 billion annually. The FDA closely monitors these adverse events and releases communications and advisories depending on the severity and frequency of the events. The FDA released such a communication regarding the drug Simvastatin in June 2011. Simvastatin, which is used to treat hyperlidemia, is one of the most heavily prescribed medications in the world, and nearly 100 million prescriptions were written for patients in 2010.

A canned demo but impressive none the less.

I have written asking for a link to the “community” version of the software. It is mentioned several times on the site but I have been unable to find the URL.

UDL Guidelines – Version 2.0: Principle I. Provide Multiple Means of Representation

Filed under: Marketing,Topic Maps — Patrick Durusau @ 2:22 pm

UDL Guidelines – Version 2.0: Principle I. Provide Multiple Means of Representation

From the webpage:

Learners differ in the ways that they perceive and comprehend information that is presented to them. For example, those with sensory disabilities (e.g., blindness or deafness); learning disabilities (e.g., dyslexia); language or cultural differences, and so forth may all require different ways of approaching content. Others may simply grasp information quicker or more efficiently through visual or auditory means rather than printed text. Also learning, and transfer of learning, occurs when multiple representations are used, because it allows students to make connections within, as well as between, concepts. In short, there is not one means of representation that will be optimal for all learners; providing options for representation is essential.

From the Universal Design for Learning (UDL) Center.

Have you ever noticed how people keep running across topic map issues? Different domains, different ways of talking about the problems but bottom line it comes down to different ways to identify the same subjects.

When they create solutions, they don’t always remember that containers in their solutions are subjects too. That may be identified differently by others. We create information silos, useful in their own domains, but unless treated as subjects, are hard to share across domains.

Hard to share because without a map between identifications, can’t tell which container goes with what other container, or subject with subject.

Need to agree we each keep our identifications and use maps from one container/subject to the other.

So we benefit from each other instead of ignoring the riches gathered by others.

The UDL makes multiple modes of access (what we call subject mapping in topic maps) it’s Principle 1!

Makes sense. You want educational content to be re-used by many learners.

Now to explore how they realize Principle 1 in action. Hoping to start a conversation where topic maps will come up.

Hadoop Streaming Made Simple using Joins and Keys with Python

Filed under: Hadoop,Python,Stream Analytics — Patrick Durusau @ 10:48 am

Hadoop Streaming Made Simple using Joins and Keys with Python

From the post:

There are a lot of different ways to write MapReduce jobs!!!

Sample code for this post https://github.com/joestein/amaunet

I find streaming scripts a good way to interrogate data sets (especially when I have not worked with them yet or are creating new ones) and enjoy the lifecycle when the initial elaboration of the data sets lead to the construction of the finalized scripts for an entire job (or series of jobs as is often the case).

When doing streaming with Hadoop you do have a few library options. If you are a Ruby programmer then wukong is awesome! For Python programmers you can use dumbo and more recently released mrjob.

I like working under the hood myself and getting down and dirty with the data and here is how you can too.

Interesting post and good tips on data exploration. Can’t really query/process the unknown.

Suggestions of other data exploration examples? (Not so much processing the known but looking to “learn” about data sources.)

Verrier: Visualizations of the French Civil Code

Filed under: Law,Law - Sources,Visualization — Patrick Durusau @ 10:42 am

Verrier: Visualizations of the French Civil Code

Legal Informatics reports in part:

Jacques Verrier has posted two visualizations of the French Code civil:

Code civil – Cartographie [a video showing the evolution of the Code civil by means of network graphs]
Code civil des Français [a network graph of the structure of the Code civil linked to the full text of the code]

Being from the only “civilian” jurisdiction in the United States (Louisiana) and having practiced law there, I had to include this as a resource. In addition to it being a good illustration of visualizing important subject matter.

The post also points to a variety of visualizations of the United States case and statutory law.

TUSTEP is open source – with TXSTEP providing a new XML interface

Filed under: Text Analytics,Text Mining,TUSTEP/TXSTEP,XML — Patrick Durusau @ 9:15 am

TUSTEP is open source – with TXSTEP providing a new XML interface

I won’t recount how many years ago I first received email from Wilhelm Ott about TUSTEP. 😉

From the TUSTEP homepage:

TUSTEP is a professional toolbox for scholarly processing textual data (including those in non-latin scripts) with a strong focus on humanities applications. It contains modules for all stages of scholarly text data processing, starting from data capture and including information retrieval, text collation, text analysis, sorting and ordering, rule-based text manipulation, and output in electronic or conventional form (including typesetting in professional quality).

Since the title “big data” is taken, perhaps we should take “complex data” for texts.

If you are exploring textual data in any detail or with XML, you should give take a look at the TUSTEP project and its new XML interface, TXSTEP.

Or consider contributing to the project as well.

Wilhelm Ott writes (in part):

We are pleased to announce that, starting with the release 2012, TUSTEP is available as open source software. It is distributed under the Revised BSD Licence and can be downloaded from www.tustep.org.

TUSTEP has a long tradition as a highly flexible, reliable, efficient suite of programs for humanities computing. It started in the early 70ies as a tool for supporting humanities projects at the University of Tübingen, relying on own funds of the University. From 1985 to 1989, a substantial grant from the Land Baden-Württemberg officially opened its distribution beyond the limits of the University and started its success as a highly appreciated research tool for many projects at about a hundred universities and academic institutions in the German speaking part of the world, represented since 1993 in the International TUSTEP User Group (ITUG). Reports on important projects relying on TUSTEP and a list of publications (includig lexicograpic works and critical editions) can be found on the tustep webpage.

TXSTEP, presently being developed in cooperation with Stuttgart Media University, offers a new XML-based user interface to the TUSTEP programs. Compared to the original TUSTEP commands, we see important advantages:

  • it will offer an up-to-date established syntax for scripting;
  • it will show the typical benefits of working with an XML editor, like content completion, highlighting, showing annotations, and, of course, verifying the code;
  • it will offer – to a certain degree – a self teaching environment by commenting on the scope of every step;
  • it will help to avoid many syntactical errors, even compared to the original TUSTEP scripting environment;
  • the syntax is in English, providing a more widespread usability than TUSTEP’s German command language.

At the TEI conference last year in Würzburg, we presented a first prototype to an international audience. We look forward to DH2012 in Hamburg next week where, during the Poster Session, a more enhanced version which already contains most of TUSTEPs functions will be presented. A demonstration of TXSTEPs functionality will include tasks which can not easily be performed by existing XML tools.

After the demo, you are invited to download a test version of TXSTEP to play with, to comment on it and to help make it a great and flexible tool for everyday – and complex – questions.

OK, I confess a fascination with complex textual analysis.

Modern Shape-Shifters

Filed under: BigData,Identity,Searching — Patrick Durusau @ 8:37 am

Someday, in the not too distant future, you will be able to tell your grandchildren about fixed data structures and values. How queries returned the results imagined by the architects of data systems. Back in the old days of “small data.”

Quite different from the scene imagined in Sifting Through a Trillion Electrons:

Because FastQuery is built on the FastBit bitmap indexing technology, Byna notes that researchers can search their data based on an arbitrary range of conditions that is defined by available data values. This essentially means that a researcher can now feasibly search a trillion particle dataset and sift out electrons by their energy values.

Researchers, not data architects, get to decide on the questions to pose.

Not hard to imagine that “small data” experiments too will be making their data available. In a variety of forms and formats.

Are you ready to consolidate those data sources based on your identification of subjects? Subjects both in content and in formalisms/structure?

To have data that shifts its shape depending upon the demands upon it?

Will you be a master of modern shape-shifters?

PS: Do read the “Trillion Electron” piece. A view of this year’s data processing options. Likely to be succeeded by technology X in the next year or so if the past is any guide.

Recommendations and how to measure the ROI with some metrics?

Filed under: Recommendation — Patrick Durusau @ 7:58 am

Recommendations and how to measure the ROI with some metrics ?

From the post:

We talked a lot about recommender systems, specially discussing the techniques and algorithms used to build and evaluate algorithmically those systems. But let’s discuss now how can we measure in quantitative terms how a social network or an on-line store can measure the return of investment (ROI) of a given recommendation.

The metrics used in recommender systems

We talk a lot about F1-measure, Accuracy, Precision, Recall, AUC, those buzzwords widely known by the machine learning researchers and data mining specialists. But do you know what is CTR, LOC, CER or TPR ? Let’s explain more about those metrics and how they can evaluate the quantitative benefits of a given recommendation.

Would you feel more comfortable if I said identification instead of recommendation?

Consider it done.

After all, a “recommendation” is some actor making a statement about identified subject. Run of the mill stuff for a topic map.

The ROI question is whether there is some benefit to that statement + identification?

Assuming you are using a topic map or similar measures to track the source of a recommendation, you could begin to attach ROI to particular sources of recommendation.

Stability as Illusion

Filed under: Information Theory,Visualization,Wikipedia — Patrick Durusau @ 5:28 am

In A Visual Way to See What is Changing Within Wikipedia, Jennifer Shockley writes:

Wikipedia is a go to source for quick answers outside the classroom, but many don’t realize Wiki is an ever evolving information source. Geekosystem’s article “Wikistats Show You What Parts Of Wikipedia Are Changing” provides a visual way to see what is changing within Wikipedia.

Is there any doubt that all of our information sources are constantly evolving?

Whether by edits to the sources or in our reading of those sources?

I wonder, have there been recall/precision studies done chronologically?

That is to say, studies of user evaluation of precision/recall on a given data set that repeat the evaluation with users at five (5) year intervals?

To learn if user evaluations of precision/recall change over time for the same queries on the same body of material?

My suspicion, without attributing a cause, is yes.

Suggestions or pointers welcome!

HCIR 2012 Challenge: People Search

Filed under: HCIR,Searching — Patrick Durusau @ 5:06 am

HCIR 2012 Challenge: People Search by Daniel Tunkelang.

From the post:

As we get ready for the Sixth Symposium on Human-Computer Interaction and Information Retrieval this October in Cambridge, MA, people around the world are working on their entries for the third HCIR Challenge.

Daniel reviews the results from HCIR 2010 (exploring news archives) and HCIR 2011 (information availability) and reviews the current challenge, people search.

The people search tasks include “hiring,” “assembling a conference program,” and “finding people to deliver patent research or expert testimony.”

I wonder if any of the disharmony relationship sites are going to sponsor teams?

July 8, 2012

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection

Filed under: Duplicates,Entity Resolution,Record Linkage — Patrick Durusau @ 4:59 pm

Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection by Peter Christen.

In the Foreword, William E. Winkler (U. S. Census Bureau and dean of record linkage), writes:

Within this framework of historical ideas and needed future work, Peter Christen’s monograph serves as an excellent compendium of the best existing work by computer scientists and others. Individuals can use the monograph as a basic reference to which they can gain insight into the most pertinent record linkage ideas. Interested researchers can use the methods and observations as building blocks in their own work. What I found very appealing was the high quality of the overall organization of the text, the clarity of the writing, and the extensive bibliography of pertinent papers. The numerous examples are quite helpful because they give real insight into a specific set of methods. The examples, in particular, prevent the researcher from going down some research directions that would often turn out to be dead ends.

I saw the alert for this volume today so haven’t had time to acquire and read it.

Given the high praise from Winkler, I expect it to be a pleasure to read and use.

Conditional Traversals With Gremlin

Filed under: Graphs,Gremlin,Traversal — Patrick Durusau @ 4:45 pm

Conditional Traversals With Gremlin by Max Lincoln.

An eligibility test that depends upon the ability to traverse to a particular node in the graph.

Reminded me of my musings on transient properties/edges.

Is not choosing an edge is the same thing as the edge not being present? For all cases?

Max mentions that NoSQL Distilled says this use case isn’t the typical one for graphs.

My suggestion is to experiment and rely on your own requirements and experiences.

Authors have to paint with a very broad brush or their books would all look like the Oxford English Dictionary (OED). Fascinating but not for the faint of heart.


BTW, when looking up the reference for the Oxford English Dictionary, the wikipedia reference mentioned that:

The Dutch dictionary Woordenboek der Nederlandsche Taal, which has similar aims to the OED, is the largest and it took twice as long to complete.

I don’t read Dutch but the dictionary is reported to be available for free at: http://gtb.inl.nl/

If you read Dutch, please confirm/deny the report. I would like to send a little note along the the OED crowd about access as a public service. (Like they would care what I think. 😉 Still, doesn’t hurt to comment every now and again.)

Introducing Py2neo and Geoff

Filed under: Geoff,Neo4j,py2neo — Patrick Durusau @ 4:03 pm

Introducing Py2neo and Geoff by Nigel Small. (podcast)

From the description:

Py2neo has become a popular library for Python developers to drive Neo4j’s REST API. In this presentation for the Neo4j User Group, Nigel Small describes how Py2neo evolved, provide an introduction to how it is used. Nigel also explores Geoff, a textual format for storing and transmitting graph data (with a syntax heavily influenced by Neo4j’s Cypher language) and how it powers the Neo4j REPL.

If you want to read along, slides.

I don’t know the order of Neo4j, Py2neo and Geoff in a trifecta but I do know they make a very nice triplet.

Make tonight a movie night and catch this presentation.

MicrobeDB: a locally maintainable database of microbial genomic sequences

Filed under: Bioinformatics,Biomedical,Database,Genome,MySQL — Patrick Durusau @ 3:54 pm

MicrobeDB: a locally maintainable database of microbial genomic sequences by Morgan G. I. Langille, Matthew R. Laird, William W. L. Hsiao, Terry A. Chiu, Jonathan A. Eisen, and Fiona S. L. Brinkman. (Bioinformatics (2012) 28 (14): 1947-1948. doi: 10.1093/bioinformatics/bts273)

Abstract

Summary: Analysis of microbial genomes often requires the general organization and comparison of tens to thousands of genomes both from public repositories and unpublished sources. MicrobeDB provides a foundation for such projects by the automation of downloading published, completed bacterial and archaeal genomes from key sources, parsing annotations of all genomes (both public and private) into a local database, and allowing interaction with the database through an easy to use programming interface. MicrobeDB creates a simple to use, easy to maintain, centralized local resource for various large-scale comparative genomic analyses and a back-end for future microbial application design.

Availability: MicrobeDB is freely available under the GNU-GPL at: http://github.com/mlangill/microbedb/

No doubt a useful project but the article seems to be at war with itself:

Although many of these centers provide genomic data in a variety of static formats such as Genbank and Fasta, these are often inadequate for complex queries. To carry out these analyses efficiently, a relational database such as MySQL (http://mysql.com) can be used to allow rapid querying across many genomes at once. Some existing data providers such as CMR allow downloading of their database files directly, but these databases are designed for large web-based infrastructures and contain numerous tables that demand a steep learning curve. Also, addition of unpublished genomes to these databases is often not supported. A well known and widely used system is the Generic Model Organism Database (GMOD) project (http://gmod.org). GMOD is an open-source project that provides a common platform for building model organism databases such as FlyBase (McQuilton et al., 2011) and WormBase (Yook et al., 2011). GMOD supports a variety of options such as GBrowse (Stein et al., 2002) and a variety of database choices including Chado (Mungall and Emmert, 2007) and BioSQL (http://biosql.org). GMOD provides a comprehensive system, but for many researchers such a complex system is not needed.

On one hand, current solutions are “…often inadequate for complex queries” and just a few lines later, “…such a complex system is not needed.”

I have no doubt that using unfamiliar and complex table structures is a burden on any user. Not to mention lacking the ability to add “unpublished genomes” or fixing versions of data for analysis.

What concerns me is the “solution” being seen as yet another set of “local” options. Which impedes the future use of the now “localized” data.

The issue raised here need to be addressed but one-off solutions seem like a particularly poor choice.

Meta-Analysis of ‘Sparse’ Data: Perspectives from the Avandia Cases

Filed under: Meta-analysis,Sparse Data — Patrick Durusau @ 3:01 pm

Meta-Analysis of ‘Sparse’ Data: Perspectives from the Avandia Cases by Michael Finkelstein and Bruce Levin.

Abstract:

Combining the results of multiple small trials to increase accuracy and statistical power, a technique called meta-analysis has become well established and increasingly important in medical studies, particularly in connection with new drugs. When the data are sparse, as they are in many such cases, certain accepted practices, applied reflexively by researchers, may be misleading because they are biased and for other reasons. We illustrate some of the problems by examining a meta-analysis of the connection between the diabetes drug Avandia (rosiglitazone) and myocardial infarction that was strongly criticized as misleading, but led to thousands of lawsuits being filed against the manufacturer and the FDA acting to restrict access to the drug. Our scrutiny of the Avandia meta-analysis is particularly appropriate because it plays an important role in ongoing litigation, has been sharply criticized, and has been subject to a more searching review in court than meta-analyses of other drugs.

A good introduction to the issues of meta-analysis, where the stakes for drug companies, can be quite high.

All clinical trials vary in some respects, the question with meta-analysis being is the variance enough (enough heterogeneity) to make meta-analysis invalid?

How would you measure heterogeneity or perhaps an experts claim of heterogeneity?

MaRC and SolrMaRC

Filed under: Librarian/Expert Searchers,Library,MARC,SolrMarc — Patrick Durusau @ 2:30 pm

MaRC and SolrMaRC by Owen Stephens.

From the post:

At the recent Mashcat event I volunteered to do a session called ‘making the most of MARC’. What I wanted to do was demonstrate how some of the current ‘resource discovery’ software are based on technology that can really extract value from bibliographic data held in MARC format, and how this creates opportunities for in both creating tools for users, and also library staff.

One of the triggers for the session was seeing, over a period of time, a number of complaints about the limitations of ‘resource discovery’ solutions – I wanted to show that many of the perceived limitations were not about the software, but about the implementation. I also wanted to show that while some technical knowledge is needed, some of these solutions can be run on standard PCs and this puts the tools, and the ability to experiment and play with MARC records, in the grasp of any tech-savvy librarian or user.

Many of the current ‘resource discovery’ solutions available are based on a search technology called Solr – part of a project at the Apache software foundation. Solr provides a powerful set of indexing and search facilities, but what makes it especially interesting for libraries is that there has been some significant work already carried out to use Solr to index MARC data – by the SolrMARC project. SolrMARC delivers a set of pre-configured indexes, and the ability to extract data from MARC records (gracefully handling ‘bad’ MARC data – such as badly encoded characters etc. – as well). While Solr is powerful, it is SolrMARC that makes it easy to implement and exploit in a library context.

SolrMARC is used by two open source resource discovery products – VuFind and Blacklight. Although VuFind and Blacklight have differences, and are written in different languages (VuFind is PHP while Blacklight is Ruby), since they both use Solr and specifically SolrMARC to index MARC records the indexing and search capabilities underneath are essentially the same. What makes the difference between implementations is not the underlying technology but the configuration. The configuration allows you to define what data, from which part of the MARC records, goes into which index in Solr.

Owen explains his excitement over these tools as:

These tools excite me for a couple of reasons:

  1. A shared platform for MARC indexing, with a standard way of programming extensions gives the opportunty to share techniques and scripts across platforms – if I write a clever set of bean shell scripts to calculate page counts from the 300 field (along the lines demonstrated by Tom Meehan in another Mashcat session), you can use the same scripts with no effort in your SolrMARC installation
  2. The ability to run powerful, but easy to configure, search tools on standard computers. I can get Blacklight or VuFind running on a laptop (Windows, Mac or Linux) with very little effort, and I can have a few hundred thousand MARC records indexed using my own custom routines and searchable via an interface I have complete control over

I like the “geek” appeal of #2, but creating value-add interfaces for the casual user is more likely to attract positive PR for a library.

As far as #1, how uniform are the semantics of MARC fields?

I suspect physical data, page count, etc., are fairly stable/common, what about more subjective fields? How would you test that proposition?

TeX Live 2012 released

Filed under: TeX/LaTeX — Patrick Durusau @ 1:48 pm

TeX Live 2012 released

From the post:

Today, TeX Live 2012 has been released. TeX Live is a comprehensive TeX and LaTeX system for Windows, Mac OS X, Linux and other Unix systems. A special version called MacTeX 2012 is available for Mac OS X.

Changes compared to TeX Live 2011:

  • tlmgr supports updates from multiple network repositories.
  • The parameter \XeTeXdashbreakstate is set to 1 by default. This allows line breaks after em-dashes and en-dashes, which has always been the behavior of other engines and formats such as plain TeX, LaTeX, and LuaTeX. Explicitly set \XeTeXdashbreakstate to 0 for perfect line-break compatibility for existing XeTeX documents.
  • The output files generated by pdftex and dvips, among others, can now exceed 2gb.
  • The 35 standard PostScript fonts are included in the output of dvips by default.
  • In the restricted \write18 execution mode, set by default, mpost is now an allowed program.
  • A texmf.cnf file is also found in ../texmf-local, e.g., /usr/local/texlive/texmf-local/web2c/texmf.cnf, if it exists.
  • The updmap script reads a per-tree updmap.cfg instead of one global config.
  • Platforms: armel-linux and mipsel-linux added; sparc-linux and i386-netbsd are no longer in the main distribution, but are available for installation as custom binaries, along with a variety of other platforms.

Which reminded me, I need to renew my TeX Users Group membership.

Thought you might need a reminder as well.

I saw this mentioned by Kirk Lowery on Facebook of all places.

statistics.com The Institute for Statistics Education

Filed under: Education,R,Statistics — Patrick Durusau @ 10:22 am

statistics.com The Institute for Statistics Education

The spread of R made me curious about certification in R?

The first “hit” on the subject was statistics.com The Institute for Statistics Education.

From their homepage:

Certificate Programs

Programs in Analytics and Statistical Studies (PASS)

From in-depth clinical trial design and analysis to data mining skills that help you make smarter business decisions, our unique programs focus on practical applications and help you master the software skills you need to stay a step ahead in your field.

http://www.statistics.com/

Biostatistics – Epidemiology

Biostatistics – Controlled Trials

Business Analytics

Data Mining

Social Science

Environmental Science

Engineering Statistics

Using R

Not with the same group or even the same subject (NetWare several versions ago), but I have had good experiences with this type of program.

Self study is always possible and sometimes the only option.

But, a good instructor can keep your interest in a specific body of material long enough to earn a certification.

Suggestions of other certification programs that would be of interest to data miners, machine learning, big data, etc., worker bees?

PS: If the courses sound pricey, slide on over the the University of Washington 3 course certificate in computational finance. At a little over $10K for 9 months.

R Integration in Weka

Filed under: Data Mining,Machine Learning,R,Weka — Patrick Durusau @ 9:57 am

R Integration in Weka by Mark Hall.

From the post:

These days it seems like every man and his proverbial dog is integrating the open-source R statistical language with his/her analytic tool. R users have long had access to Weka via the RWeka package, which allows R scripts to call out to Weka schemes and get the results back into R. Not to be left out in the cold, Weka now has a brand new package that brings the power of R into the Weka framework.

Weka

In this section I briefly cover what the new RPlugin package for Weka >= 3.7.6 offers. This package can be installed via Weka’s built-in package manager.

Here is an list of the functionality implemented:

  • Execution of arbitrary R scripts in Weka’s Knowledge Flow engine
  • Datasets into and out of the R environment
  • Textual results out of the R environment
  • Graphics out of R in png format for viewing inside of Weka and saving to files via the JavaGD graphics device for R
  • A perspective for the Knowledge Flow and a plugin tab for the Explorer that provides visualization of R graphics and an interactive R console
  • A wrapper classifier that invokes learning and prediction of R machine learning schemes via the MLR (Machine Learning in R) library

The use of R appears to be spreading! (Oracle, SAP, Hadoop, just to name a few that come readily to mind.)

Where is it on your list of data mining tools?

I first saw this at DZone.

Collection of CSS3 Techniques and Tutorials

Filed under: CSS3,HTML5 — Patrick Durusau @ 9:23 am

Collection of CSS3 Techniques and Tutorials

From the post:

For today I have selected few fresh and useful CSS3 tutorials for your next project. CSS3 and HTML5 are the topics that you see and hear everyday and I think it is the right time to start diving into it. While there is no freedom in expressing our ideas in CSS3 prior to weak support in major browsers we want to be the first ones when it is fully supported right? Besides chasing the trends you will gain valuable experience in HTML5 and jQuery which will help you to solve your problems in your projects.

This techniques can be proficiently functioned using markup, HTML, and some improved properties of CSS3. It has many features which are not compatible with the old web browsers and hence it will require some present day internet browsers like Internet Explorer 7 & 8, Chrome, Safari and Firefox to use the CSS3. It can be used in developing the following techniques:

  • It can be used to create multiple backgrounds
  • Developing and drawing border images
  • Handling of opacity
  • Used in text- shadowing and box sizing
  • Used for support of columns of many different web layouts.

If we go back on time we will realize how far we have come from the time of Adobe’s Flash and JavaScript, which were used to create some cool designs. Then came the new version of CSS and it was CSS3 that has transformed the world of animation with its transition features. It is the markup language that has many other applications that can be used for designing web pages that are written in XHTML or HTML. Whereas on the other hand, CSS is made for primary separation of documents that are written in simple markup languages. The content accessibility provides elements which can differ from fonts to layouts and colors.

In this compilation you will find few tutorials on creating amazing transition effects, slideshows, navigation menus and much more.

However clever your models, searching/extraction protocols, etc., there will come a time when results have to be delivered.

Graphs, charts, visualizations, networks, static and interactive will play a role, but so will basic web design/delivery techniques.

You should be aware of what is possible, should you decide to out source the design task.

This collection illustrates the range of modern web presentations. Will also make you wonder about some of the web presentations you encounter.

I first saw this at DZone.

July 7, 2012

When are graphs coming to ESPN?

Filed under: Graphs,Networks — Patrick Durusau @ 6:16 pm

Sooner than you think.

Read: PageRank Algorithm Reveals Soccer Teams’ Strategies.

From the post:

Many readers will have watched the final of the Euro 2012 soccer championships on Sunday in which Spain demolished a tired Italian team by 4 goals to nil. The result, Spain’s third major championship in a row, confirms the team as the best in the world and one of the greatest in history.

So what makes Spain so good? Fans, pundits and sports journalists all point to Spain’s famous strategy of accurate quick-fire passing, known as the tiki-taka style. It’s easy to spot and fabulous to watch, as the game on Sunday proved. But it’s much harder to describe and define.

That looks set to change. Today, Javier Lopez Pena at University College London and Hugo Touchette at Queen Mary University of London reveal an entirely new way to analyse and characterise the performance of soccer teams and players using network theory.

They say their approach produces a quantifiable representation of a team’s style, identifies key individuals and highlights potential weaknesses.

Their idea is to think of each player as a node in a network and each pass as an edge that connects nodes. They then distribute the nodes in a way that reflects the playing position of each player on the pitch.

Add to the graph representation links to related resources and analysis, can you say topic map?

I wonder when they will think about adding the officials and what fouls they usually call on what players?


In full: A network theory analysis of football strategies

Abstract:

We showcase in this paper the use of some tools from network theory to describe the strategy of football teams. Using passing data made available by FIFA during the 2010 World Cup, we construct for each team a weighted and directed network in which nodes correspond to players and arrows to passes. The resulting network or graph provides a direct visual inspection of a team’s strategy, from which we can identify play pattern, determine hot-spots on the play and localize potential weaknesses. Using different centrality measures, we can also determine the relative importance of each player in the game, the `popularity’ of a player, and the effect of removing players from the game.

I first saw this at Four short links: 4 July 2012 by Nat Torkington.

Natural Language Processing | Hub

Natural Language Processing | Hub

From the “about” page:

NLP|Hub is an aggregator of news about Natural Language Processing and other related topics, such as Text Mining, Information Retrieval, Linguistics or Machine Learning.

NLP|Hub finds, collects and arranges related news from different sites, from academic webs to company blogs.

NLP|Hub is a product of Cilenis, a company specialized in Natural Language Processing.

If you have interesting posts for NLP|Hub, or if you do not want NLP|Hub indexing your text, please contact us at info@cilenis.com

Definitely going on my short list of sites to check!

Introduction to Neo Technology and Neo4j [Questions of Scale?]

Filed under: Neo4j — Patrick Durusau @ 3:21 pm

Introduction to Neo Technology and Neo4j by Curt Monash.

Curt interviewed Emil Eifrem (CEO/cofounder), Johan Svensson (CTO/cofounder), and Philip Rathle (Senior Director of Products) for an overview of Neo Technology and Neo4j.

I would be sure to read the updates from Neo4j along with the main story.

There is one point that I would like to bring to the fore and that is whether adoption of Neo4j depends on its ability to scale?

While I agree with the supplemental answers given by Neo4j staff, my counter question would have been: Scale in the sense of what requirement?

Can a single Neo4j instance represent the entire Internet at the user level?

No, but only one project each at Google, Microsoft and the NSA have that as a requirement. They all have projects where Neo4j would be a good fit. (I can’t tell you about Neo4j good fit NSA projects. If I knew I could not say and if I said, they would kill me. You understand.)

Be specific, what are your scaling requirements?

When people talk about “web scale” or “internet scale,” they have lost sight of your requirements but are still trying to find your wallet.

I would run the other way.

« Newer PostsOlder Posts »

Powered by WordPress