Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

July 29, 2012

National Cooperative Geologic Mapping Program

Filed under: Geologic Maps,Mapping,Maps — Patrick Durusau @ 4:56 am

National Cooperative Geologic Mapping Program

From this week’s Scout Report:

The National Cooperative Geologic Mapping Program (NCGMP) is “the primary source of funds for the production of geologic maps in the United States.” The NCGMP was created by the National Geologic Mapping Act of 1992 and its work includes producing surficial and bedrock geologic map coverage for the entire country. The program has partnered with a range of educational institutions, and this site provides access to many of the fruits of this partnership, along with educational materials. The place to start here is the What’s a Geologic Map? area. Here visitors can read a helpful article on this subject, authored by David R. Soller of the U.S. Geological Survey. Moving on, visitors can click on the National Geologic Map Database link. The database contains over 88,000 maps, along with a lexicon of geologic names, and material on the NCGMP’s upcoming mapping initiatives. Those persons with an interest in the organization of the NCGMP should look at the Program Components area. Finally, the Products-Standards area contains basic information on the technical standards and expectations for the mapping work.

More grist for your topic map mill!

July 28, 2012

Montage: An Astronomical Image Mosaic Engine

Filed under: Astroinformatics,Image Processing — Patrick Durusau @ 7:56 pm

Montage An Astronomical Image Mosaic Engine

From the webpage:

Montage is a toolkit for assembling Flexible Image Transport System (FITS) images into custom mosaics.

Since I mentioned astronomical data earlier today I thought about including this for your weekend leisure time!

Twitter Words Association Analysis

Filed under: Tweets,Visualization,Word Association,Word Cloud — Patrick Durusau @ 7:41 pm

Twitter Words Association Analysis by Gunjan Amit.

From the post:

Recently I came across Twitter Spectrum tool from Jeff Clerk. This tool is modified version of News Spectrum tool.

Here you can enter two topics and then analyse the associated words based on twitter data. Blue and Red color represents the associated words of those two topics whereas Purple represents the common words.

You can click on any word to see the related tweets. The visualization is really awesome and you can easily analyze the data.

For example, I have taken “icici” and “hdfc” as two topics. Below is the twitter spectrum based on these two topics:

Looks interesting as a “rough cut” or exploratory tool.

Predictive analytics might not have predicted the Aurora shooter

Filed under: News,Predictive Analytics — Patrick Durusau @ 7:24 pm

Predictive analytics might not have predicted the Aurora shooter by Robert L. Mitchell.

From the post:

Could aggressive data mining by law enforcement prevent heinous crimes, such as the recent mass murder in Aurora, CO., by catching killers before they can act?

The Aurora shooter certainly left a long trail of transactions. In the two months leading up to the crime he bought more than 6,000 rounds of ammunition, several guns, head-to-toe ballistic protective gear and accelerants and other chemicals used to build homemade explosives. These purchases were made from both online ecommerce sites and brick and mortar stores, and more than 50 packages were sent to his apartment, according to news reports.

Robert injects a note of sanity into recent discussions about data mining and the Aurora shooting by quoting Dean Abbott of Abbott Analytics as saying:

Much as we’d like to think we can solve the problem with technology, it turns out that there is no magic bullet. “Something like this could be valuable,” Abbott says. “I just don’t think it’s obvious that it would be fruitful.”

That would make a good movie script but not much else. (Oh, wait, there is such a movie, Minority Report.)

Predictive analytics are useful in the aggregate, but we already knew that from the Foundation Triology (or you could ask your local sociologist).

Exploring the Universe with Machine Learning

Filed under: Astroinformatics,Machine Learning — Patrick Durusau @ 6:59 pm

Exploring the Universe with Machine Learning by Bruce Berriman.

From the post:

A short while ago, I attended a webinar on the above topic by Alex Gray and Nick Ball. The traditional approach to analytics involves identifying which collections of data or collections of information follow sets of rules. Machine learning (ML) takes a very different approach by finding patterns and making predictions from large collections of data.

The post reviews the presentation, CANFAR + Skytree Webinar Presentation (video here).

Good way to broaden your appreciation for “big data.” Astronomy has been awash in “big data” for years.

The Coming Majority: Mainstream Adoption and Entrepreneurship [Cloud Gift Certificates?]

Filed under: Cloud Computing,Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 6:22 pm

The Coming Majority: Mainstream Adoption and Entrepreneurship by James Locus.

From the post:

Small companies, big data.

Big data is sometimes at odds with the business-savvy entrepreneur who wants to exploit its full potential. In essence, the business potential of big data is the massive (but promising) elephant in the room that remains invisible because the available talent necessary to take full advantage of the technology is difficult to obtain.

Inventing new technology for the platform is critical, but so too is making it easier to use.

The future of big data may not be a technological breakthrough by a select core of contributing engineers, but rather a platform that allows common, non-PhD holding entrepreneurs and developers to innovate. Some incredible progress has been made in Apache Hadoop with Hortonworks’ HDP (Hortonworks Data Platform) in minimizing the installation process required for full implementation. Further, the improved MapReduce v2 framework also greatly lowers the risk of adoption for businesses by expressly creating features designed to increase efficiency and usability (e.g. backward and forward compatibility). Finally, with HCatalog, the platform is opened up to integrate with new and existing enterprise applications.

What kinds of opportunities lie ahead when more barriers are eliminated?

You really do need a local installation of Hadoop for experimenting.

But at the same time, having a minimal cloud account where you can whistle up some serious computing power isn’t a bad idea either.

That would make an interesting “back to school” or “holiday present for your favorite geek” sort of present. A “gift certificate” for so many hours/cycles a month on a cloud platform.

BTW, what projects would you undertake if barriers of access and capacity were diminished if not removed?

July 27, 2012

Probabilistic Data Structures for Web Analytics and Data Mining

Filed under: Data Mining,Probabilistic Data Structures,Web Analytics — Patrick Durusau @ 7:34 pm

Probabilistic Data Structures for Web Analytics and Data Mining by Ilya Katsov.

Speaking of scalability, consider:

Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption. These data structures can be used both as temporary data accumulators in query processing procedures and, perhaps more important, as a compact – sometimes astonishingly compact – replacement of raw data in stream-based computing.

For some subjects, we have probabilistic identifications, based upon data that is too voluminous or rapid to allow for a “definitive” identification.

The techniques introduced here will give you a grounding in data structures to deal with those situations. Interesting reading.

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

Days Five and Six of a Predictive Coding Narrative

Filed under: e-Discovery,Email,Prediction,Predictive Analytics — Patrick Durusau @ 3:23 pm

Days Five and Six of a Predictive Coding Narrative: Deep into the weeds and a computer mind-meld moment by Ralph Losey.

From the post:

This is my fourth in a series of narrative descriptions of an academic search project of 699,082 Enron emails and attachments. It started as a predictive coding training exercise that I created for Jackson Lewis attorneys. The goal was to find evidence concerning involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane. The third and fourth days are described in Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree.

In this fourth installment I continue to describe what I did in days five and six of the project. In this narrative I go deep into the weeds and describe the details of multimodal search. Near the end of day six I have an affirming hybrid multimodal mind-meld moment, which I try to describe. I conclude by sharing some helpful advice I received from Joseph White, one of Kroll Ontrack’s (KO) experts on predictive coding and KO’s Inview software. Before I launch into the narrative, a brief word about vendor experts. Don’t worry, it is not going to be a commercial for my favorite vendors; more like a warning based on hard experience.

You will learn a lot about predictive analytics and e-discovery from this series of posts but the most important paragraphs I have read thus far:

When talking to the experts, be sure that you understand what they say to you, and never just nod in agreement when you do not really get it. I have been learning and working with new computer software of all kinds for over thirty years, and am not at all afraid to say that I do not understand or follow something.

Often you cannot follow because the explanation is so poor. For instance, often the words I hear from vendor tech experts are too filled with company specific jargon. If what you are being told makes no sense to you, then say so. Keep asking questions until it does. Do not be afraid of looking foolish. You need to be able to explain this. Repeat back to them what you do understand in your own words until they agree that you have got it right. Do not just be a parrot. Take the time to understand. The vendor experts will respect you for the questions, and so will your clients. It is a great way to learn, especially when it is coupled with hands-on experience.

Insisting that experts explain until you understand what is being said will help you avoid costly mistakes and make you more sympathetic to a client’s questions when you are the expert.

The technology and software will change for predictive coding will change beyond recognition in a few short years.

Demanding and giving explanations that “explain” is a skill that will last a lifetime.

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree

Filed under: e-Discovery,Email,Prediction,Predictive Analytics — Patrick Durusau @ 3:04 pm

Days Three and Four of a Predictive Coding Narrative: Where I find that the computer is free to disagree by Ralph Losey.

From the post:

This is the third in a series of detailed descriptions of a legal search project. The project was an academic training exercise for Jackson Lewis e-discovery liaisons conducted in May and June 2012. I searched a set of 699,082 Enron emails and attachments for possible evidence pertaining to involuntary employee terminations. The first day of search is described in Day One of a Predictive Coding Narrative: Searching for Relevance in the Ashes of Enron. The second day is described in Day Two of a Predictive Coding Narrative: More Than A Random Stroll Down Memory Lane.

The description of day-two was short, but it was preceded by a long explanation of my review plan and search philosophy, along with a rant in favor of humanity and against over-dependence on computer intelligence. Here I will just stick to the facts of what I did in days three and four of my search using Kroll Ontrack’s (KO) Inview software.

Interesting description of where Ralph and the computer disagree on relevant/irrelevant judgement on documents.

Unless I just missed it, Ralph is only told be the software what rating a document was given, not why the software arrived at that rating. Yes?

If you knew what terms drove a particular rating, it would be interesting to “comment out” those terms in a document to see the impact on its relevance rating.

Information Theory, Pattern Recognition, and Neural Networks

Filed under: Inference,Information Theory,Neural Networks,Pattern Recognition — Patrick Durusau @ 11:13 am

Information Theory, Pattern Recognition, and Neural Networks by David MacKay.

David MacKay’s lectures with slides on information theory, inference and neural networks. Spring/Summer of 2012.

Just in time for the weekend!

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

Anaconda: Scalable Python Computing

Filed under: Anaconda,Data Analysis,Machine Learning,Python,Statistics — Patrick Durusau @ 10:19 am

Anaconda: Scalable Python Computing

Easy, Scalable Distributed Data Analysis

Anaconda is a distribution that combines the most popular Python packages for data analysis, statistics, and machine learning. It has several tools for a variety of types of cluster computations, including MapReduce batch jobs, interactive parallelism, and MPI.

All of the packages in Anaconda are built, tested, and supported by Continuum. Having a unified runtime for distributed data analysis makes it easier for the broader community to share code, examples, and best practices — without getting tangled in a mess of versions and dependencies.

Good way to avoid dependency issues!

On scaling, I am reminded of a developer who designed a Python application to require upgrading for “heavy” use. Much to their disappointment, Python scaled under “heavy” use with no need for an upgrade. 😉

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

A Simple Topic Model

Filed under: Latent Dirichlet Allocation (LDA),Topic Models (LDA) — Patrick Durusau @ 9:54 am

A Simple Topic Model by Allen Beye Riddell.

From the post:

NB: This is an extended version of the appendix of my paper exploring trends in German Studies in the US between 1928 and 2006. In that paper I used a topic model (Latent Dirichlet Allocation); this tutorial is intended to help readers understand how LDA works.

Topic models typically start with two banal assumptions. The first is that in a large collection of texts there exist a number of distinct groups (or sources) of texts. In the case of academic journal articles, these groups might be associated with different journals, authors, research subfields, or publication periods (e.g. the 1950s and 1980s). The second assumption is that texts from different sources tend to use different vocabulary. If we are presented with an article selected from one of two different academic journals, one dealing with literature and another with archeology, and we are told only that the word “plot” appears frequently in the article, we would be wise to guess the article comes from the literary studies journal.1

A major obstacle to understanding the remaining details about how topic models work is that their description relies on the abstract language of probability. Existing introductions to Latent Dirichlet Allocation (LDA) tend to be pitched either at an audience already fluent in statistics or at an audience with minimal background.2 This being the case, I want to address an audience that has some background in probability and statistics, perhaps at the level of the introductory texts of Hoff (2009), Lee (2004), or Kruschke (2010).

A good walk through on using a topic model (Latent Dirichlet Allocation).

I saw this in Christophe Lalanne’s Bag of Tweets for July 2012.

Computational Aspects of Social Networks (CASoN) [Conference]

Filed under: Conferences,Networks,Social Networks — Patrick Durusau @ 7:51 am

Computational Aspects of Social Networks (CASoN)

Important Dates:

Paper submission due:Aug. 15 2012
Notification of paper acceptance:Sep. 15 2012
Final manuscript due:Sep. 30 2012
Registration and full payment due:Sep. 30 2012
Conference date:Nov. 21-23 2012

Conference venue: São Carlos, Brazil.

From the Call for Papers:

The International Conference on Computational Aspects of Social Networks (CASoN 2012) brings together an interdisciplinary venue for social scientists, mathematicians, computer scientists, engineers, computer users, and students to exchange and share their experiences, new ideas, and research results about all aspects (theory, applications and tools) of intelligent methods applied to Social Networks, and to discuss the practical challenges encountered and the solutions adopted.

Social networks provide a powerful abstraction of the structure and dynamics of diverse kinds of people or people-to-technology interaction. These social network systems are usually characterized by the complex network structures and rich accompanying contextual information. Recent trends also indicate the usage of complex network as a key feature for next generation usage and exploitation of the Web. This international conference on Computational Aspect of Networks is focused on the foundations of social networks as well as case studies, empirical, and other methodological works related to the computational tools for the automatic discovery of Web-based social networks. This conference provides an opportunity to compare and contrast the ethological approach to social behavior in animals (including the study of animal tracks and learning by members of the same species) with web-based evidence of social interaction, perceptual learning, information granulation, the behavior of humans and affinities between web-based social networks. The main topics cover the design and use of various computational intelligence tools and software, simulations of social networks, representation and analysis of social networks, use of semantic networks in the design and community-based research issues such as knowledge discovery, privacy and protection, and visualization.

We solicit original research and technical papers not published elsewhere. The papers can be theoretical, practical and application, and cover a broad set of intelligent methods, with particular emphasis on Social Network computing.

One of the more interesting aspects of social network study, at least to me, is the existence of social networks of researchers who are studying social networks. Implies, to me at least, that “subjects” of discussion have their origins in social networks.

Some approaches, I won’t name names, take “subjects” as given and never question their origins. That leads directly to fragile systems/ontologies because change isn’t taken into account.

Clearly saying “stop” is insufficient, else the many attempts to fix some standardized language would have succeeded long ago.

If you know approaches that attempt to allow for change, would appreciate a note.

PostgreSQL’s place in the New World Order

Filed under: Cloud Computing,Database,Heroku,PostgreSQL — Patrick Durusau @ 4:22 am

PostgreSQL’s place in the New World Order by Matthew Soldo.

Description:

Mainstream software development is undergoing a radical shift. Driven by the agile development needs of web, social, and mobile apps, developers are increasingly deploying to platforms-as-a-service (PaaS). A key enabling technology of PaaS is cloud-services: software, often open-source, that is consumed as a service and operated by a third-party vendor. This shift has profound implications for the open-source world. It enables new business models, increases emphasis on user-experience, and creates new opportunities.

PostgreSQL is an excellent case study in this shift. The PostgreSQL project has long offered one of the most reliable open source databases, but has received less attention than competing technologies. But in the PaaS and cloud-services world, reliability and open-ness become increasingly important. As such, we are seeing the beginning of a shift in adoption towards PostgreSQL.

The datastore landscape is particularly interesting because of the recent attention given to the so-called NoSQL technologies. Data is suddenly sexy again. This attention is largely governed by the same forces driving developers to PaaS, namely the need for agility and scalability in building modern apps. Far from being a threat to PostgreSQL, these technologies present an amazing opportunity for showing the way towards making PostgreSQL more powerful and more widely adopted.

The presentation sounds great, but alas, the slidedeck is just a slidedeck. 🙁

I do recommend it for the next to last slide graphic. Very cool!

(And it may be time to take a another look at PostgreSQL as well.)

London 2012 Olympic athletes: the full list

Filed under: Data,Dataset — Patrick Durusau @ 4:10 am

London 2012 Olympic athletes: the full list

Simon Rogers fo the Guardian reports scrapping together the full list of Olympic athletes into a single data set.

Simon says:

We’ve just scratched the surface of this dataset – you can download it below. What can you do with it?

I would ask the question somewhat differently: Having the data set, what can you reliably add to it?

Aggregate data analysis is interesting but then so is aggregated data on the individual athletes.

PS: If you do something interesting with the data set, be sure to let the Guardian know.

July 26, 2012

How to Track Your Data: Rule-Based Data Provenance Tracing Algorithms

Filed under: Data,Provenance — Patrick Durusau @ 3:43 pm

How to Track Your Data: Rule-Based Data Provenance Tracing Algorithms by Zhang, Qing Olive; Ko, Ryan K L; Kirchberg, Markus; Suen, Chun-Hui; Jagadpramana, Peter; Lee, Bu Sung.

Abstract:

As cloud computing and virtualization technologies become mainstream, the need to be able to track data has grown in importance. Having the ability to track data from its creation to its current state or its end state will enable the full transparency and accountability in cloud computing environments. In this paper, we showcase a novel technique for tracking end-to-end data provenance, a meta-data describing the derivation history of data. This breakthrough is crucial as it enhances trust and security for complex computer systems and communication networks. By analyzing and utilizing provenance, it is possible to detect various data leakage threats and alert data administrators and owners; thereby addressing the increasing needs of trust and security for customers’ data. We also present our rule-based data provenance tracing algorithms, which trace data provenance to detect actual operations that have been performed on files, especially those under the threat of leaking customers’ data. We implemented the cloud data provenance algorithms into an existing software with a rule correlation engine, show the performance of the algorithms in detecting various data leakage threats, and discuss technically its capabilities and limitations.

Interesting work but data provenance isn’t solely a cloud computing, virtualization issue.

Consider the ongoing complaints in Washington, D.C. on who leaked what to who and why?

All posturing to one side, that is a data provenance and subject identity based issue.

The sort of thing where a topic map application could excel.

MongoDB 2.2.0-rc0

Filed under: MongoDB — Patrick Durusau @ 2:38 pm

MongoDB 2.2.0-rc0

The latest unstable release of MongoDB.

Release notes for 2.2.0-rc0.

Among the changes you will find:

  • Aggregation Framework
  • TTL Collections
  • Concurrency Improvements
  • Query Optimizer Improvements
  • Tag Aware Sharding

among others.

Schema.org and One Hundred Years of Search

Filed under: Indexing,Searching,Text Mining,Web History — Patrick Durusau @ 2:13 pm

Schema.org and One Hundred Years of Search by Dan Brickley.

From the post:

Slides and video are already in the Web, but I wanted to post this as an excuse to plug the new Web History Community Group that Max and I have just started at W3C. The talk was part of the Libraries, Media and the Semantic Web meetup hosted by the BBC in March. It gave an opportunity to run through some forgotten history, linking Paul Otlet, the Universal Decimal Classification, schema.org and some 100 year old search logs from Otlet’s Mundaneum. Having worked with the BBC Lonclass system (a descendant of Otlet’s UDC), and collaborated with the Aida Slavic of the UDC on their publication of Linked Data, I was happy to be given the chance to try to spell out these hidden connections. It also turned out that Google colleagues have been working to support the Mundaneum and the memory of this early work, and I’m happy that the talk led to discussions with both the Mundaneum and Computer History Museum about the new Web History group at W3C.

Sounds like a great starting point!

But the intellectual history of indexing and search runs far deeper than one hundred years. Our current efforts are likely to profit from a deeper knowledge of our roots.

Network biology methods integrating biological data for translational science

Filed under: Bioinformatics,Text Mining — Patrick Durusau @ 1:35 pm

Network biology methods integrating biological data for translational science by Gurkan Bebek, Mehmet Koyutürk, Nathan D. Price, and Mark R. Chance. (Brief Bioinform (2012) 13 (4): 446-459. doi: 10.1093/bib/bbr075)

Abstract:

The explosion of biomedical data, both on the genomic and proteomic side as well as clinical data, will require complex integration and analysis to provide new molecular variables to better understand the molecular basis of phenotype. Currently, much data exist in silos and is not analyzed in frameworks where all data are brought to bear in the development of biomarkers and novel functional targets. This is beginning to change. Network biology approaches, which emphasize the interactions between genes, proteins and metabolites provide a framework for data integration such that genome, proteome, metabolome and other -omics data can be jointly analyzed to understand and predict disease phenotypes. In this review, recent advances in network biology approaches and results are identified. A common theme is the potential for network analysis to provide multiplexed and functionally connected biomarkers for analyzing the molecular basis of disease, thus changing our approaches to analyzing and modeling genome- and proteome-wide data.

Integrating as well as filtering data for various modeling purposes are standard topic map fare.

Looking forward to complex integration needs driving further development of topic maps!

Mining the pharmacogenomics literature—a survey of the state of the art

Filed under: Bioinformatics,Genome,Pharmaceutical Research,Text Mining — Patrick Durusau @ 1:23 pm

Mining the pharmacogenomics literature—a survey of the state of the art by Udo Hahn, K. Bretonnel Cohen, and Yael Garten. (Brief Bioinform (2012) 13 (4): 460-494. doi: 10.1093/bib/bbs018)

Abstract:

This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.

At thirty-six (36) pages and well over 200 references, this is going to take a while to digest.

Some questions to be thinking about while reading:

How are entity recognition issues same/different?

What techniques have you seen before? How different/same?

What other techniques would you suggest?

The Parsons Journal for Information Mapping (PJIM)

Filed under: Graphics,Mapping,Visualization — Patrick Durusau @ 12:40 pm

The Parsons Journal for Information Mapping (PJIM)

A publication of the Parsons Institute for Information Mapping (PIIM), which hosts this journal and other information mapping resources.

The journal has a rolling, open call for papers and projects, the closest one being the October 2012 issue:

Abstract due: August, 20, 2012

Final Submissions due: September 24, 2012

From the journal homepage:

The Parsons Journal for Information Mapping (PJIM) is an academic journal and online forum to promote research, writing, and digital execution of theories in the field of information mapping and its related disciplines. Our mission is to identify and disseminate knowledge about the fields of information mapping, information design, data visualization, information taxonomies/structures, data analytics, informatics, information systems, and user interface design.

PJIM focuses on both the theoretical and practical aspects of information visualization. With each issue, the Journal aims to present novel ideas and approaches that advance the field of Knowledge Visualization through visual, engineering, and cognitive methods.

We have an rolling, open-call for submissions for original essays, academic manuscripts, interactive and non-interactive projects, and project documentation that address representation, processing, and communication of information. We encourage interdisciplinary thinking and approaches and are open to submissions regarding, but not limited to, the following disciplines:

  • Visual analysis and interpretation
  • Social, political, or economic discourse surrounding information, distribution and use
  • Cognition, thinking, and learning
  • Visual and perceptual literacy
  • Historical uses of information in imagery
  • Semiotics

Links to research papers and other resources await you at the PIIM homepage.

Understanding Apache Hadoop’s Capacity Scheduler

Filed under: Clustering (servers),Hadoop,MapReduce — Patrick Durusau @ 10:43 am

Understanding Apache Hadoop’s Capacity Scheduler by Arun Murthy

From the post:

As organizations continue to ramp the number of MapReduce jobs processed in their Hadoop clusters, we often get questions about how best to share clusters. I wanted to take the opportunity to explain the role of Capacity Scheduler, including covering a few common use cases.

Let me start by stating the underlying challenge that led to the development of Capacity Scheduler and similar approaches.

As organizations become more savvy with Apache Hadoop MapReduce and as their deployments mature, there is a significant pull towards consolidation of Hadoop clusters into a small number of decently sized, shared clusters. This is driven by the urge to consolidate data in HDFS, allow ever-larger processing via MapReduce and reduce operational costs & complexity of managing multiple small clusters. It is quite common today for multiple sub-organizations within a single parent organization to pool together Hadoop/IT budgets to deploy and manage shared Hadoop clusters.

Initially, Apache Hadoop MapReduce supported a simple first-in-first-out (FIFO) job scheduler that was insufficient to address the above use case.

Enter the Capacity Scheduler.

Shared Hadoop clusters?

So long as we don’t have to drop off our punch cards at the shared Hadoop cluster computing center I suppose that’s ok.

😉

Just teasing.

Shared Hadoop clusters are more cost effective and makes better use of your Hadoop specialists.

Why we build our platform on HDFS

Filed under: Cloudera,Hadoop,HDFS — Patrick Durusau @ 10:16 am

Why we build our platform on HDFS by Charles Zedlewski

Charles Zedlewski pushes the number of Hadoop competitors up to twelve:

It’s not often the case that I have a chance to concur with my colleague E14 over at Hortonworks but his recent blog post gave the perfect opportunity. I wanted to build on a few of E14’s points and add some of my own.

A recent GigaOm article presented 8 alternatives to HDFS. They actually missed at least 4 others. For over a year, Parascale marketed itself as an HDFS alternative (until it became an asset sale to Hitachi). Appistry continues to market its HDFS alternative. I’m not sure if it’s released yet but it is very evident that Symantec’s Veritas unit is proposing its Clustered Filesystem (CFS) as an alternative to HDFS as well. HP Ibrix has also supported the HDFS API for some years now.

The GigaOm article implies that the presence of twelve other vendors promoting alternatives must speak to some deficiencies in HDFS for what else would motivate so many offerings? This really draws the incorrect conclusion. I would ask this:

What can we conclude from the fact that there are:

Best links I have for Hadoop competitors (for your convenience and additions):

  1. Appistry
  2. Cassandra (DataStax)
  3. Ceph (Inktrack)
  4. Clustered Filesystem (CFS)
  5. Dispersed Storage Network (Cleversafe)
  6. GPFS (IBM)
  7. Ibrix
  8. Isilon (EMC)
  9. Lustre
  10. MapR File System
  11. NetApp Open Solution for Hadoop
  12. Parascale

Law Libraries, Government Transparency, and the Internet

Filed under: Government,Law,Library — Patrick Durusau @ 9:35 am

Law Libraries, Government Transparency, and the Internet by Daniel Schuman.

From the post:

This past weekend I was fortunate to attend the American Association of Law Libraries 105th annual conference. On Sunday morning, I gave a presentation to a special interest section entitled “Law Libraries, Government Transparency, and the Internet,” where I discussed the important role that law libraries can play in making the government more open and transparent.

The slides illustrate the range of legal material, which is by definition difficult for the lay reader to access, that is becoming available.

I see an important role for law libraries as curators who create access points for both professional as well as lay researchers.

I first saw this at Legal Informatics.

Understanding Indexing [Webinar]

Filed under: Database,Indexing — Patrick Durusau @ 8:12 am

Understanding Indexing [Webinar]

July 31st 2012 Time: 2PM EDT / 11AM PDT

From the post:

Three rules on making indexes around queries to provide good performance

Application performance often depends on how fast a query can respond and query performance almost always depends on good indexing. So one of the quickest and least expensive ways to increase application performance is to optimize the indexes. This talk presents three simple and effective rules on how to construct indexes around queries that result in good performance.

[graphic button omitted]

This webinar is a general discussion applicable to all databases using indexes and is not specific to any particular MySQL® storage engine (e.g., InnoDB, TokuDB®, etc.). The rules are explained using a simple model that does NOT rely on understanding B-trees, Fractal Tree® indexing, or any other data structure used to store the data on disk.

Indexing is one of those “overloaded” terms in information technologies.

Indexing can refer to:

  1. Database indexing
  2. Search engine indexing
  3. Human indexing

just to name a few of the more obvious uses.

To be sure, you need to be aware of, if not proficient at, all three and this webinar should be a start on #1.

PS: If you know of a more complete typology of indexing, perhaps with pointers into the literature, please give a shout!

July 25, 2012

London Olympics: download the full schedule as open data

Filed under: Data — Patrick Durusau @ 7:00 pm

London Olympics: download the full schedule as open data

From the Guardian, where so much useful data gathers.

The Case for Curation: The Relevance of Digest and Citator Results in Westlaw and Lexis

Filed under: Aggregation,Curation,Legal Informatics,LexisNexis,Westlaw — Patrick Durusau @ 6:51 pm

The Case for Curation: The Relevance of Digest and Citator Results in Westlaw and Lexis by Susan Nevelow Mart and Jeffrey Luftig.

Abstract:

Humans and machines are both involved in the creation of legal research resources. For legal information retrieval systems, the human-curated finding aid is being overtaken by the computer algorithm. But human-curated finding aids still exist. One of them is the West Key Number system. The Key Number system’s headnote classification of case law, started back in the nineteenth century, was and is the creation of humans. The retrospective headnote classification of the cases in Lexis’s case databases, started in 1999, was created primarily although not exclusively with computer algorithms. So how do these two very different systems deal with a similar headnote from the same case, when they link the headnote to the digesting and citator functions in their respective databases? This paper continues an investigation into this question, looking at the relevance of results from digest and citator search run on matching headnotes in ninety important federal and state cases, to see how each performs. For digests, where the results are curated – where a human has made a judgment about the meaning of a case and placed it in a classification system – humans still have an advantage. For citators, where algorithm is battling algorithm to find relevant results, it is a matter of the better algorithm winning. But no one algorithm is doing a very good job of finding all the relevant results; the overlap between the two citator systems is not that large. The lesson for researchers: know how your legal research system was created, what involvement, if any, humans had in the curation of the system, and what a researcher can and cannot expect from the system you are using.

A must read for library students and legal researchers.

For legal research, the authors conclude:

The intervention of humans as curators in online environments is being recognized as a way to add value to an algorithm’s results, in legal research tools as well as web-based applications in other areas. Humans still have an edge in predicting which cases are relevant. And the intersection of human curation and algorithmically-generated data sets is already well underway. More curation will improve the quality of results in legal research tools, and most particularly can be used to address the algorithmic deficit that still seems to exist where analogic reasoning is needed. So for legal research, there is a case for curation. [footnotes omitted]

The distinction between curation, human gathering of relevant material and aggregation, machine gathering of potentially relevant material looks quite useful.

Curation anyone?

I first saw this at Legal Informatics.

Beyond The Pie Chart

Filed under: Graphics,Interface Research/Design,Visualization — Patrick Durusau @ 6:31 pm

Beyond The Pie Chart : Creating new visualization tools to reveal treasures in the data by Hunter Whitney.

From the post:

The New Treasure Maps

If a picture’s worth a thousand words, what’s the value of an image representing a terabyte of data?

Much of the vast sea of data flowing around the world every day is left unexplored because the existing tools and charts can’t help us effectively navigate it. Data visualization, interactive infographics, and related visual representation techniques can play a key part in helping people find their way through the wide expanses of data now opening up. There’s a long history of depicting complex information in graphical forms, but the gusher of data now flowing from corporations, governments and scientific research requires more powerful and sophisticated visualization tools to manage it.

Just as a compass needle can give us direction in physical space, a chart line can direct our way through data. As effective as these simple lines may be, they can only take us so far. For many purposes, advanced data visualization methods may never replace Excel, but in our data-saturated world, they might well be the best tools for the job. UX designers can play a key role in creating these new tools and charts. In these treasure maps of data, perhaps UX marks the spot.

Start of what promises to be an interesting series of posts on visualization.

Using MySQL Full-Text Search in Entity Framework

Filed under: Full-Text Search,MySQL,Searching,Text Mining — Patrick Durusau @ 6:14 pm

Using MySQL Full-Text Search in Entity Framework

Another database/text search post not for the faint of heart.

MySQL database supports an advanced functionality of full-text search (FTS) and full-text indexing described comprehensively in the documentation:

We decided to meet the needs of our users willing to take advantage of the full-text search in Entity Framework and implemented the full-text search functionality in our Devart dotConnect for MySQL ADO.NET Entity Framework provider.

Hard to say why Beyond Search picked up the Oracle post but left the MySQL one hanging.

I haven’t gone out and counted noses but I suspect there are a lot more installs of MySQL than Oracle 11g. Just my guess. Don’t buy or sell stock based on my guesses.

« Newer PostsOlder Posts »

Powered by WordPress