Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 25, 2012

Health Design Challenge [$50K in Prizes – Deadline 30th Nov 2012]

Filed under: Challenges,Health care,Medical Informatics — Patrick Durusau @ 10:01 am

Health Design Challenge

More details at the site but:

ONC & VA invite you to rethink how the medical record is presented. We believe designers can use their talents to make health information patient-centered and improve the patient experience.

Being able to access your health information on demand can be lifesaving in an emergency situation, can help prevent medication errors, and can improve care coordination so everyone who is caring for you is on the same page. However, too often health information is presented in an unwieldy and unintelligible way that makes it hard for patients, their caregivers, and their physicians to use. There is an opportunity for talented designers to reshape the way health records are presented to create a better patient experience.

Learn more at http://healthdesignchallenge.com

The purpose of this effort is to improve the design of the medical record so it is more usable by and meaningful to patients, their families, and others who take care of them. This is an opportunity to take the plain-text Blue Button file and enrich it with visuals and a better layout. Innovators will be invited to submit their best designs for a medical record that can be printed and viewed digitally.

This effort will focus on the content defined by a format called the Continuity of Care Document (CCD). A CCD is a common template used to describe a patient’s health history and can be output by electronic medical record (EMR) software. Submitted designs should use the sections and fields found in a CCD. See http://blue-button.github.com/challenge/files/health-design-challenge-fields.pdf for CCD sections and fields.

Entrants will submit a design that:

  • Improves the visual layout and style of the information from the medical record
  • Makes it easier for a patient to manage his/her health
  • Enables a medical professional to digest information more efficiently
  • Aids a caregiver such as a family member or friend in his/her duties and responsibilities with respect to the patient

Entrants should be conscious of how the wide variety of personas will affect their design. Our healthcare system takes care of the following types of individuals:

  • An underserved inner-city parent with lower health literacy
  • A senior citizen that has a hard time reading
  • A young adult who is engaged with technology and mobile devices
  • An adult whose first language is not English
  • A patient with breast cancer receiving care from multiple providers
  • A busy mom managing her kids’ health and helping her aging parents

This is an opportunity for talented individuals to touch the lives of Americans across the country through design. The most innovative designs will be showcased in an online gallery and in a physical exhibit at the Annual ONC Meeting in Washington DC.

should be enough to capture your interest.

Winners will be announced December 12, 2012.

Only the design is required, no working code.

Still, a topic map frame of mind may give you more options than other approaches.

Exploiting Parallelism and Scalability (XPS) (NSF)

Filed under: Language,Language Design,Parallelism,Scalability — Patrick Durusau @ 4:53 am

Exploiting Parallelism and Scalability (XPS) (NSF)

From the announcement:

Synopsis of Program:

Computing systems have undergone a fundamental transformation from the single-processor devices of the turn of the century to today’s ubiquitous and networked devices and warehouse-scale computing via the cloud. Parallelism has become ubiquitous at many levels. The proliferation of multi- and many-core processors, ever-increasing numbers of interconnected high performance and data intensive edge devices, and the data centers servicing them, is enabling a new set of global applications with large economic and social impact. At the same time, semiconductor technology is facing fundamental physical limits and single processor performance has plateaued. This means that the ability to achieve predictable performance improvements through improved processor technologies has ended.

The Exploiting Parallelism and Scalability (XPS) program aims to support groundbreaking research leading to a new era of parallel computing. XPS seeks research re-evaluating, and possibly re-designing, the traditional computer hardware and software stack for today’s heterogeneous parallel and distributed systems and exploring new holistic approaches to parallelism and scalability. Achieving the needed breakthroughs will require a collaborative effort among researchers representing all areas– from the application layer down to the micro-architecture– and will be built on new concepts and new foundational principles. New approaches to achieve scalable performance and usability need new abstract models and algorithms, programming models and languages, hardware architectures, compilers, operating systems and run-time systems, and exploit domain and application-specific knowledge. Research should also focus on energy- and communication-efficiency and on enabling the division of effort between edge devices and clouds.

Full proposals due: February 20, 2013, (due by 5 p.m. proposer’s local time).

I see the next wave of parallelism and scalability being based on language and semantics. Less so on more cores and better designs in silicon.

Not surprising since I work in languages and semantics every day.

Even so, consider a go-cart that exceeds 160 miles per hour (260 km/h) remains a go-cart.

Go beyond building a faster go-cart.

Consider language and semantics when writing your proposal for this program.

Cloudera’s Impala and the Semantic “Mosh Pit”

Filed under: Cloudera,Hadoop,Impala — Patrick Durusau @ 4:30 am

Cloudera’s Impala tool binds Hadoop with business intelligence apps by Christina Farr.

From the post:

In traditional circles, Hadoop is viewed as a bright but unruly problem child.

Indeed, it is still in the nascent stages of development. However the scores of “big data” startups that leverage Hadoop will tell you that it is here to stay.

Cloudera, the venture-backed startup that ushered the mainstream deployment of Hadoop, has unveiled a new technology at the Hadoop World, the data-focused conference in New York.

Its new product, known as “Impala”, addresses many of the concerns that large enterprises still have about Hadoop, namely that it does not integrate well with traditional business intelligence applications.

“We have heard this criticism,” said Charles Zedlewski, Cloudera’s VP of Product in a phone interview with VentureBeat. “That’s why we decided to do something about it,” he said.

Impala enables its users to store vast volumes of unwieldy data and run queries in HBase, Hadoop’s NoSQL database. What’s interesting is that it is built to maximise speed: it runs on top of Hadoop storage, but speaks to SQL and works with pre-existing drivers.

Legacy data is a well known concept.

Are we approaching the point of legacy applications? Applications that are too widely/deeply embedded in IT infrastructure to be replaced?

Or at least not replaced quickly?

The semantics of legacy data are known to be fair game for topic maps. Do the semantics of legacy applications offer the same possibilities?

Mapping the semantics of “legacy” applications, their ancestors and descendants, data, legacy and otherwise, results in a semantic mosh pit.

Some strategies for a semantic “mosh pit:”

  1. Prohibit it (we know the success rate on that option)
  2. Ignore it (costly but more “successful” than #1)
  3. Create an app on top of the legacy app (an error repeated isn’t an error, it’s following precedent)
  4. Sample it (but what are you missing?)
  5. Map it (being mindful of cost/benefit)

Which one are you going to choose?

October 24, 2012

Online Education- MongoDB and Oracle R Enterprise

Filed under: MongoDB,Oracle,R — Patrick Durusau @ 7:03 pm

Online Education- MongoDB and Oracle R Enterprise by Ajay Ohri.

Ajay brings news of two MongoDB online courses, one for developers and one for DBAs, and an Oracle offering on R.

The MongoDB classes started Monday (22nd of October) so you had better hurry to register.

JournalTOCs

Filed under: Data Source,Library,Library software,Publishing — Patrick Durusau @ 4:02 pm

JournalTOCs

Most publishers have TOC services for new issues of their journals.

JournalTOCs aggregates TOCs from publishers and maintains a searchable database of their TOC postings.

A database that is accessible via a free API I should add.

The API should be a useful way to add journal articles to a topic map, particularly when you want to add selected articles and not entire issues.

I am looking forward to using and exploring JournalTOCs.

Suggest you do the same.

Kurt Thomas on Security at Twitter and Everywhere

Filed under: BigData,Security,Tweets — Patrick Durusau @ 3:32 pm

Kurt Thomas on Security at Twitter and Everywhere by Marti Hearst.

From the post:

Kurt Thomas is a former Twitter engineer and a current PhD student at UC Berkeley who studies how the criminal underground conspires to make money via unintended uses of computer systems.

Lecture notes.

Focus is on underground economies that depend upon theft of data or compromise of access to data.

Suspect if you started making money over a free service, that would be an “unintended use” as well.

GreenPlum Chorus

Filed under: Greenplum — Patrick Durusau @ 3:02 pm

GreenPlum Chorus

Do you know anything about GreenPlum’s Chorus?

I ask because I saw a blurb about it being open sourced and have been unable to find better documentation than:

Big Data Agility for your Data Science Team

Greenplum Chorus enables Big Data agility for your data science team. The first solution of its kind, Greenplum Chorus provides an analytic productivity platform that enables the team to search, explore, visualize, and import data from anywhere in the organization. It provides rich social network features that revolve around datasets, insights, methods, and workflows, allowing data analysts, data scientists, IT staff, DBAs, executives, and other stakeholders to participate and collaborate on Big Data. Customers deploy Chorus to create a self-service agile analytic infrastructure; teams can create workspaces on the fly with self-service provisioning, and then instantly start creating and sharing insights.

Chorus breaks down the walls between all of the individuals involved in the data science team and empowers everyone who works with your data to more easily collaborate and derive insight from that data.

That’s just gibberish.

That’s from: http://www.greenplum.com/products/chorus.

What I can find on the rest of the site isn’t any better:

The data sheet gives system requires but no practical details about Chorus.

The GreenPlum Forum for Chorus has two posts.

The Chorus Project wiki isn’t helpful either.

I may be interested in contributing to the project, particularly documentation, but without knowing more, it’s hard to say.

The Data Science Community on Twitter

Filed under: Data Science,Graphs,Networks,Social Networks,Tweets,Visualization — Patrick Durusau @ 2:07 pm

The Data Science Community on Twitter

From the webpage:

659 Twitter accounts linked to data science, May 2012.

Linkage of Twitter accounts to display followers and following nodes.

That sounds so inadequate (and is).

You need to go see the page, play with it and then come back.

How was that? Impressive yes?

OK, how would that experience be different if you were using a topic map?

More/less information? Other display options?

It is an impressive piece of eye candy but I have a sense it could be so much more.

You?

Kaggle Digit Recognizer: A K-means attempt

Filed under: K-Means Clustering,K-Nearest-Neighbors,Machine Learning — Patrick Durusau @ 9:05 am

Kaggle Digit Recognizer: A K-means attempt by Michael Needham.

From the post:

Over the past couple of months Jen and I have been playing around with the Kaggle Digit Recognizer problem – a ‘competition’ created to introduce people to Machine Learning.

The goal in this competition is to take an image of a handwritten single digit, and determine what that digit is.

You are given an input file which contains multiple rows each containing 784 pixel values representing a 28×28 pixel image as well as a label indicating which number that image actually represents.

One of the algorithms that we tried out for this problem was a variation on the k-means clustering one whereby we took the values at each pixel location for each of the labels and came up with an average value for each pixel.

The results of machine learning are likely to be direct or indirect input into your topic maps.

Useful evaluation of that input will depend your understanding of machine learning.

October 23, 2012

Basics of JavaScript and D3 for R Users

Filed under: D3,Javascript,R — Patrick Durusau @ 10:58 am

Basics of JavaScript and D3 for R Users by Jerzy Wieczorek.

From the post:

Hadley Wickham, creator of the ggplot2 R package, has been learning JavaScript and its D3 library for the next iteration of ggplot2 (tentatively titled r2d3?)… so I suspect it’s only a matter of time before he pulls the rest of the R community along.

Below are a few things that weren’t obvious when I first tried reading JavaScript code and the D3 library in particular. (Please comment if you notice any errors.) Then there’s also a quick walkthrough for getting D3 examples running locally on your computer, and finally a list of other tutorials & resources. In a future post, we’ll explore one of the D3 examples and practice tweaking it.

Perhaps these short notes will help other R users get started more quickly than I did. Even if you’re a ways away from writing complex JavaScript from scratch, it can still be useful to take one of the plentiful D3 examples and modify it for your own purposes.

Just in case you don’t have time today to build clusters on EC2. 😉

Being mindful that delivery of content is what leads to sales.

Or, knowing isn’t the same as showing.

The first may make you feel important. The second may lead to sales.

Up to you.

Jurimetrics (Modern Uses of Logic in Law (MULL))

Filed under: Law,Legal Informatics,Logic,Semantics — Patrick Durusau @ 10:48 am

Jurimetrics (Modern Uses of Logic in Law (MULL))

From the about page:

Jurimetrics, The Journal of Law, Science, and Technology (ISSN 0897-1277), published quarterly, is the journal of the American Bar Association Section of Science & Technology Law and the Center for Law, Science & Innovation. Click here to view the online version of Jurimetrics.

Jurimetrics is a forum for the publication and exchange of ideas and information about the relationships between law, science and technology in all areas, including:

  • Physical, life and social sciences
  • Engineering, aerospace, communications and computers
  • Logic, mathematics and quantitative methods
  • The uses of science and technology in law practice, adjudication and court and agency administration
  • Policy implications and legislative and administrative control of science and technology.

Jurimetrics was first published in 1959 under the leadership of Layman Allen as Modern Uses of Logic in Law (MULL). The current name was adopted in 1966. Jurimetrics is the oldest journal of law and science in the United States, and it enjoys a circulation of more than 8,000, which includes all members of the ABA Section of Science & Technology Law.

I just mentioned this journal in Wyner et al.: An Empirical Approach to the Semantic Representation of Laws, but wanted to also capture its earlier title, Modern Uses of Logic in Law (MULL), because I am likely to search for it as well.

I haven’t looked at the early issues in some years but as I recall, they were quite interesting.

Wyner et al.: An Empirical Approach to the Semantic Representation of Laws

Filed under: Language,Law,Legal Informatics,Machine Learning,Semantics — Patrick Durusau @ 10:37 am

Wyner et al.: An Empirical Approach to the Semantic Representation of Laws

Legal Informatics brings news of Dr. Adam Wyner’s paper, An Empirical Approach to the Semantic Representation of Laws, and quotes the abstract as:

To make legal texts machine processable, the texts may be represented as linked documents, semantically tagged text, or translated to formal representations that can be automatically reasoned with. The paper considers the latter, which is key to testing consistency of laws, drawing inferences, and providing explanations relative to input. To translate laws to a form that can be reasoned with by a computer, sentences must be parsed and formally represented. The paper presents the state-of-the-art in automatic translation of law to a machine readable formal representation, provides corpora, outlines some key problems, and proposes tasks to address the problems.

The paper originated at Project IMPACT.

If you haven’t looked at semantics and the law recently, this is a good opportunity to catch up.

I have only skimmed the paper and its references but am already looking for online access to early issues of Jurimetrics (a journal by the American Bar Association) that addressed such issues many years ago.

Should be fun to see what has changed and by how much. What issues remain and how they are viewed today.

Deploying a GraphLab/Spark/Mesos cluster on EC2

Filed under: Amazon Web Services AWS,Clustering (servers),GraphLab,Spark — Patrick Durusau @ 10:10 am

Deploying a GraphLab/Spark/Mesos cluster on EC2 by Danny Bickson.

From the post:

I got the following instructions from my collaborator Jay (Haijie Gu) who spent some time learning Spark cluster deployment and adapted those useful scripts to be used in GraphLab.

This tutorial will help you spawn a GraphLab distributed cluster, run alternating least squares task, collect the results and shutdown the cluster.

This tutorial is very new beta release. Please contact me if you are brave enough to try it out..

I haven’t seen any responses to Danny’s post. Is yours going to be the first one?

Linguistic Society of America (LSA)

Filed under: Language,Linguistics — Patrick Durusau @ 10:01 am

Linguistic Society of America (LSA)

The membership page says:

The Linguistic Society of America is the major professional society in the United States that is exclusively dedicated to the advancement of the scientific study of language. With nearly 4,000 members, the LSA speaks on behalf of the field of linguistics and also serves as an advocate for sound educational and political policies that affect not only professionals and students of language, but virtually all segments of society. Founded in 1924, the LSA has on many occasions made the case to governments, universities, foundations, and the public to support linguistic research and to see that our scientific discoveries are effectively applied.

Language and linguistics are important in the description of numeric data but even more so for non-numeric data.

Another avenue to sharpen your skills at both.

PS: I welcome your suggestions of other language/linguistic institutions and organizations. Even if our machines don’t understand natural language, our users do.

[T]he [God]father of Google Glass?

Filed under: BigData,Marketing,Privacy — Patrick Durusau @ 9:32 am

The original title is 3 Big Data Insights from the Grandfather of Google Glass. The post describes MIT Media Lab Professor Alex ‘Sandy’ Pentland as the “Grandfather of Google Glass.”

Let’s review Pentland’s three points to see if my title is more appropriate:

1) Big Data is about people.

SP: Big Data is principally about people, it’s not about RFID tags and things like that. So that immediately raises questions about privacy and data ownership.

I mean, this looks like a nightmare scenario unless there’s something that means that people are more in charge of their data and it’s not something that can be used to spy on them. Fortunately as a consequence of this discussion group at the World Economic Forum, we now have the Consumer Privacy Bill of Rights which says you control data about you. It’s not the phone company, it’s not the ad company. And interestingly what that does is it means that the data is more available because it’s more legitimate. People feel safer about using it.

I feel so much better knowing about the “Consumer Privacy Bill of Rights.” Don’t you?

With secret courts, imprisonment without formal charges, government sanctioned murder, torture, in the United States or at its behest, my data won’t be used against me.

You might want to read Leon Panetta Plays Chicken Little before you decide that the current administration, with its Consumer Privacy Bill of Rights has much concern for your privacy.

2) Cell phones are one of the biggest sources of Big Data. Smart phones are becoming universal remote controls.
….
Not so much in this country but in other parts of the world, your phone is the way you interface through the entire world. And so it’s also a window into what your choices are and what you do.

Having a single interface makes gathering intelligence a lot easier than hiring spies and collaborators.

Surveillance is cheaper in bulk quantities.

3) Big Data will be about moving past averages to understanding patterns at the individual level. Doing so will allow us to build a Periodic Table of human behavior.

SP: We’re moving past this sort of Enlightenment way of thinking in terms of markets and competition and big averages and asking, how can we make the information environment at the human level, at the individual level, work for everybody?

I see no signs of a lack of thinking in terms of markets and competition. Are Apple and Google competing? Are Microsoft and IBM competing? Are the various information gateways competing?

It is certainly that case that any of the aforementioned and others, would like to have everyone as a consumer.

Equality as a consumer for information service providers isn’t that interesting to me.

You?

The universal surveillance that Pentland foresees does offer opportunities for topic maps.

The testing of electronic identities tied to the universal interface, a cell phone.

For a fee, an electronic identity provider will build an electronic identity record tied to a cell phone with residential address, credit history, routine shopping entries, etc.

Topic maps can test how closely an identity matches other identities along a number of dimensions. (For seekers or hiders.)

The quoted post by: Conor Myhrvold and David Feinleib.

I first saw this at KDNuggets.

Up to Date on Open Source Analytics

Filed under: Analytics,MySQL,PostgreSQL,R,Software — Patrick Durusau @ 8:17 am

Up to Date on Open Source Analytics by Steve Miller.

Steve updates his Wintel laptop with the latest releases of open source analytics tools.

Steve’s list:

What’s on your list?

I first saw this mentioned at KDNuggets.

The Ultimate User Experience

Filed under: Image Recognition,Interface Research/Design,Marketing,Usability,Users — Patrick Durusau @ 4:55 am

The Ultimate User Experience by Tim R. Todish.

From the post:

Today, more people have mobile phones than have electricity or safe drinking water. In India, there are more cell phones than toilets! We all have access to incredible technology, and as designers and developers, we have the opportunity to use this pervasive technology in powerful ways that can change people’s lives.

In fact, a single individual can now create an application that can literally change the lives of people across the globe. With that in mind, I’m going to highlight some examples of designers and developers using their craft to help improve the lives of people around the world in the hope that you will be encouraged to find ways to do the same with your own skills and talents.

I may have to get a cell phone to get a better understanding of its potential when combined with topic maps.

For example, the “hot” night spots are well known in New York City. What if a distributed information network imaged guests as they arrived/left and maintained a real time map of images + locations (no names)?

That would make a nice subscription service, perhaps with faceted searching by physical characteristics.

Open Data vs. Private Data?

Filed under: Data,Open Data — Patrick Durusau @ 4:38 am

Why Government Should Care Less About Open Data and More About Data by Andrea Di Maio.

From the post:

Among the flurry of activities and deja-vu around open data that governments worldwide, in all tiers are pursuing to increase transparency and fuel a data economy, I found something really worth reading in a report that was recently published by the Danish government.

Good Basic Data for Everyone – A Driver for Growth and Efficiency” takes a different spin than many others by saying that:

Basic data is the core information authorities use in their day-to-day case processing. Basic data is e.g. data on individuals, businesses, properties, , addresses and geography. This information, called basic data, is reused throughout the public sector. Reuse of high-quality data is an essential basis for public authorities to perform their tasks properly and efficiently. Basic data can include personal data.

While most of the categories are open data, the novelty is that for the first time personal and open data is seen for what it is, i.e. data. The document suggests the development of a Data Distributor, which would be responsible for conveying data from different data to its consumers, both inside and outside government. The document also assumes that personal data may be ultimately distributed via a common public-sector data distributor.

Besides what is actually written in the document, this opens the door for a much needed shift from service orientation to data orientation in government service delivery. Stating that data must flow freely across organizational boundaries, irrespective of the type of data (and of course within appropriate policy constraints) is hugely important to lay the foundations for effective integration of services and processes across agencies, jurisdictions, tiers and constituencies.

Combining this with some premises of the US Digital Strategy, which highlights an information layer distinct from a platform layer, which is in turn distinct from a presentation layer, one starts seeing a move toward the centrality of data, which may finally emerge to the emergence of citizen data stores that would put control of service access and integration in the hand of individuals.

If there is novelty in the Danish approach, it is from being “open data.” That is all citizens can draw equally on the “basic data” for whatever purpose.

Property records, geographic, geological and other maps, plus addresses were combined long ago in the United States as “private data.”

Despite being collected at taxpayer expense, private industry sells access to collated public data.

Open data may provide businesses with collated public data at a lower cost, but as an expense to the public.

What is know as a false dilemma: We can buy back data government collected on our behalf or we can pay government to collect and collate it for the few.


The “individual being in charge of their data” is too obvious a fiction to delay us here. Isn’t true now, no signs it will become true. If you doubt that, restrict the distribution of your credit report. Post a note when you accomplish that task.

When V = Volume [HST Telemetry Data]

Filed under: Astroinformatics,BigData — Patrick Durusau @ 3:54 am

Personal PCs have TB disk storage. A TB of RAM isn’t far behind. Multi-TBs of both are available in high-end appliances.

One solution when v = volume is to pump up the storage volume. But you can always find data sets that are “big data” for your current storage.

Fact is, “big data” has always outrun current storage. The question of how to store more data than convenient has been asked and answered before. I encountered one of those answers last night.

The abstract to the paper reads:

The Hubble Space Telescope (HST) generates on the order of 7,000 telemetry values, many of which are sampled at 1Hz, and with several hundred parameters being sampled at 40Hz. Such data volumes would quickly tax even the largest of processing facilities. Yet the ability to access the telemetry data in a variety of ways, and in particular, using ad hoc (i.e., no a priori fixed) queries, is essential to assuring the long term viability and usefulness of this instrument. As part of the recent NASA initiative to re-engineer HST’s ground control systems, a concept arose to apply newly available data warehousing technologies to this problem. The Space Telescope Science Institute was engaged to develop a pilot to investigate the technology and to create a proof-of-concept testbed that could be demonstrated and evaluated for operational use. This paper describes this effort and its results.

The authors framed their v = volume problem as:

Then there’s the shear volume of the telemetry data. At its nominal format and rate, the HST generates over 3,000 monitored samples per second. Tracking each sample as a separate record would generate over 95 giga-records/year, or assuming a 16 year Life-of-Mission (LOM), 1.5 tera-records/LOM. Assuming a minimal 20 byte record per transaction yields 1.9 terabytes/year or 30 terabytes/LOM. Such volumes are supported by only the most exotic and expensive custom database systems made.

We may smile at the numbers now but this was 1998. As always, solutions were needed in the near term, not in a decade or two.

The authors did find a solution. Their v = 30 terabytes/LOM was reduced to v = 2.5 terabytes/LOM.

In the author’s words:

By careful study of the data, we discovered two properties that could significantly reduce this volume. First, instead of capturing each telemetry measurement, by only capturing when the measurement changed value – we could reduce the volume by almost 3-to-1. Second, we recognized that roughly 100 parameters changed most often (i.e., high frequency parameters) and caused the largest volume of the “change” records. By averaging these parameters over some time period, we could still achieve the necessary engineering accuracy while again reducing the volume of records. In total, we reduced the volume of data down to a reasonable 250 records/sec or approximately 2.5 terabytes/LOM.

Two obvious lessons for v = volume cases:

  • Capture only changes in values
  • Capture average for rapidly changing values over time (if meets accuracy requirements)

Less obvious lesson:

  • Study data carefully to understand its properties relative to your requirements.

Studying, understanding and capturing your understanding of your data will benefit you and subsequent researchers working with the same data.

Whether your v = volume is the same as mine or not.


Quotes are from: “A Queriable Repository for HST Telemetry Data, a Case Study in using Data Warehousing for Science and Engineering” by Joseph A. Pollizzi, III and Karen Lezon, Astronomical Data Analysis Software and Systems VII, ASP Conference Series, Vol. 145, 1998, Editors: R. Albrecht, R. N. Hook and H. A. Bushouse, pp.367-370.

There are other insights and techniques of interest in this article but I leave them for another post.

October 22, 2012

Whisper: Tracing the Propagation of Twitter Messages in Time and Space

Filed under: Graphics,Tweets,Visualization — Patrick Durusau @ 6:25 pm

Whisper: Tracing the Propagation of Twitter Messages in Time and Space by Andrew Vande Moere.

From the post:

Whisper [whisperseer.com] is a new data visualization technique that traces how Twitter messages propagate, in particular in terms of its temporal trends, its social and spatial extent, and its community response.

Subject of a paper at: IEEE Infovis/Visweek 2012
.

Where I found:

Whisper: Tracing the Spatiotemporal Process of Information Diffusion in Real Time by Nan Cao, Yu-Ru Lin, Xiaohua Sun, David Lazer, Shixia Liu, Huamin Qu.

Abstract:

When and where is an idea dispersed? Social media, like Twitter, has been increasingly used for exchanging information, opinions and emotions about events that are happening across the world. Here we propose a novel visualization design, Whisper, for tracing the process of information diffusion in social media in real time. Our design highlights three major characteristics of diffusion processes in social media: the temporal trend, social-spatial extent, and community response of a topic of interest. Such social, spatiotemporal processes are conveyed based on a sunflower metaphor whose seeds are often dispersed far away. In Whisper, we summarize the collective responses of communities on a given topic based on how tweets were retweeted by groups of users, through representing the sentiments extracted from the tweets, and tracing the pathways of retweets on a spatial hierarchical layout. We use an efficient flux line-drawing algorithm to trace multiple pathways so the temporal and spatial patterns can be identified even for a bursty event. A focused diffusion series highlights key roles such as opinion leaders in the diffusion process. We demonstrate how our design facilitates the understanding of when and where a piece of information is dispersed and what are the social responses of the crowd, for large-scale events including political campaigns and natural disasters. Initial feedback from domain experts suggests promising use for today’s information consumption and dispersion in the wild.

The videos at Andrew’s post are particularly impressive.

Monitoring tweets and their content appears to be a growing trend. Governments are especially interested in such techniques.

HBase Futures

Filed under: Hadoop,HBase,Hortonworks,Semantics — Patrick Durusau @ 2:28 pm

HBase Futures by Devaraj Das.

From the post:

As we have said here, Hortonworks has been steadily increasing our investment in HBase. HBase’s adoption has been increasing in the enterprise. To continue this trend, we feel HBase needs investments in the areas of:

  1. Reliability and High Availability (all data always available, and recovery from failures is quick)
  2. Autonomous operation (minimum operator intervention)
  3. Wire compatibility (to support rolling upgrades across a couple of versions at least)
  4. Cross data-center replication (for disaster recovery)
  5. Snapshots and backups (be able to take periodic snapshots of certain/all tables and be able to restore them at a later point if required)
  6. Monitoring and Diagnostics (which regionserver is hot or what caused an outage)

Probably just a personal prejudice but I would have mentioned semantics in that list.

You?

Spanner – …SQL Semantics at NoSQL Scale

Filed under: NoSQL,Spanner,SQL — Patrick Durusau @ 2:18 pm

Spanner – It’s About Programmers Building Apps Using SQL Semantics at NoSQL Scale by Todd Hoff.

From the post:

A lot of people seem to passionately dislike the term NewSQL, or pretty much any newly coined term for that matter, but after watching Alex Lloyd, Senior Staff Software Engineer Google, give a great talk on Building Spanner, that’s the term that fits Spanner best.

Spanner wraps the SQL + transaction model of OldSQL around the reworked bones of a globally distributed NoSQL system. That seems NewSQL to me.

As Spanner is a not so distant cousin of BigTable, the NoSQL component should be no surprise. Spanner is charged with spanning millions of machines inside any number of geographically distributed datacenters. What is surprising is how OldSQL has been embraced. In an earlier 2011 talk given by Alex at the HotStorage conference, the reason for embracing OldSQL was the desire to make it easier and faster for programmers to build applications. The main ideas will seem quite familiar:

  • There’s a false dichotomy between little complicated databases and huge, scalable, simple ones. We can have features and scale them too.
  • Complexity is conserved, it goes somewhere, so if it’s not in the database it’s pushed to developers.
  • Push complexity down the stack so developers can concentrate on building features, not databases, not infrastructure.
  • Keys for creating a fast-moving app team: ACID transactions; global Serializability; code a 1-step transaction, not 10-step workflows; write queries instead of code loops; joins; no user defined conflict resolution functions; standardized sync; pay as you go, get what you pay for predictable performance.

Spanner did not start out with the goal of becoming a NewSQL star. Spanner started as a BigTable clone, with a distributed file system metaphor. Then Spanner evolved into a global ProtocolBuf container. Eventually Spanner was pushed by internal Google customers to become more relational and application programmer friendly.

If you can’t stay for the full show, Todd provides a useful summary of the video. But if you have the time, take the time to enjoy the presentation!.

Searching Big Data’s Open Source Roots

Filed under: BigData,Hadoop,Lucene,LucidWorks,Mahout,Open Source,Solr — Patrick Durusau @ 1:56 pm

Searching Big Data’s Open Source Roots by Nicole Hemsoth.

Nicole talks to Grant Ingersoll, Chief Scientist at LucidWorks, about the open source roots of big data.

No technical insights but a nice piece to pass along to the c-suite. Investment in open source projects can pay rich dividends. So long as you don’t need them next quarter. 😉

And a snapshot of where we are now, which is on the brink of new tools and capabilities in search technologies.

Accountability = “unintended consequences”? [Benghazai Cables]

Filed under: Government,Government Data,Topic Maps,Transparency — Patrick Durusau @ 1:43 pm

House Oversight Committee Chairman Darrell Issa (R- Calif.), is reported by the Huffington Post to have released “sensitive but unclassified” State Department cables that contained the names of Libyans working within the United States. (Benghazi Consulate Attack: Darrell Issa Releases Raw Libya Cables, Obama Administration Cries Foul)

Acrobat Reader says there are 121 pages in:

State Department Cables – Benghzai, Libya (created last Friday morning)

Not sure what that means.

What the State Department means by “unintended consequences?”

Do they mean…

  • Liyan or U.S. nationals may be held accountable for crimes in the U.S. or other countries?
  • consequences for Libyans who are working against the interest of their fellow Libyans?
  • consequences for Libyans who are favoring their friends and families in Libya, at the expense of other Libyans?
  • consequences for Libyans currying favor with the U.S. State Department?

If there are “unintended consequences,” it may be they are being held accountable for their actions.

Being held accountable is probably the reason the State Department shuns transparency.

Both for themselves and others.

Would mapping the Benghazai cables bring the House Oversight Committee closer to holding someone accountable for that attack?

Boy Scout Explusions – Oil Drop Semantics

Data on decades of Boy Scout expulsions released by Nathan Yau.

Nathan points to an interactive map, searchable list and downloadable data from the Los Angeles Times of data from the Boy Scouts of America on people expelled from the Boy Scouts for suspicions of sexual abuse.

The LA Times has done a great job with this data set (and the story) but it also illustrates a limitation in current data practices.

All of these cases occurred in jurisdictions with laws against sexual abuse of children.

If a local sheriff or district attorney reads about this database, how do they tie it into their databases?

Not at simple as saying “topic map,” if that’s what you were anticipating.

Among the issues that would need addressing:

  • Confidentiality – Law enforcement and courts have their own rules about sharing data.
  • Incompatible System Semantics – The typical problem that is encountered in business enterprises, writ large. Every jurisdiction is likely to have its own rules, semantics and files.
  • Incompatible Data Semantics – Assuming systems talk to each other, the content and its semantics will vary from one jurisdiction to another.
  • Subjects Evading Identification – The subjects (sorry!) in question are trying to avoid identification.

You could get funding for a conference of police administrators to discuss how to organize additional meetings to discuss potential avenues for data sharing and get the DHS to fund a large screen digital TV (not for the meeting, just to have one). Consultants could wax and whine about possible solutions if someday you decided on one.

I have a different suggestion: Grab your records guru and meet up with an overlapping or neighboring jurisdiction’s data guru and one of their guys. For lunch.

Bring note pads and sample records. Talk about how you share information between officers (that you and your counter-part). Let the data gurus talk about how they can share data.

Practical questions of how to share data and what does your data mean now? Make no global decisions, no award medals for attending, etc.

Do that once or twice a month for six months. Write down what worked, what didn’t work (just as important). Each of you picks an additional partner. Share what you have learned.

The documenting and practice at information sharing will be the foundation for more formal information sharing systems. Systems based on documented sharing practices, not how administrators imagine sharing works.

Think of it as “oil drop semantics.”

Start small and increase only as more drops are added.

The goal isn’t a uniform semantic across law enforcement but understanding what is being said. That understanding can be mapped into a topic map or other information sharing strategy. But understanding comes first, mapping second.

New version of Get-Another-Label available

Filed under: Crowd Sourcing,Mechanical Turk,oDesk,Semantics — Patrick Durusau @ 8:49 am

New version of Get-Another-Label available by Panos Ipeirotis.

From the post:

I am often asked what type of technique I use for evaluating the quality of the workers on Mechanical Turk (or on oDesk, or …). Do I use gold tests? Do I use redundancy?

Well, the answer is that I use both. In fact, I use the code “Get-Another-Label” that I have developed together with my PhD students and a few other developers. The code is publicly available on Github.

We have updated the code recently, to add some useful functionality, such as the ability to pass (for evaluation purposes) the true answers for the different tasks, and get back answers about the quality of the estimates of the different algorithms.

Panos continues his series on the use of crowd sourcing.

Just a thought experiment at the moment but could semantic gaps between populations be “discovered” by use of crowd sourcing?

That is to create tasks that require “understanding” some implicit semantic in the task and then collecting the answer.

There being no “incorrect” answers but answers that reflect the differing perceptions of the semantics of the task.

A way to get away from using small groups of college students for such research? (Nothing against small groups of college students but they best represent small groups of college students. May need a broader semantic range.)

REST::Neo4p – A Perl “OGM”

Filed under: Graphs,Neo4j,Perl — Patrick Durusau @ 4:13 am

REST::Neo4p – A Perl “OGM”

From the post:

This is a guest post by Mark A. Jensen, a DC area bioinformatics scientist. Thanks a lot Mark for writing the impressive Neo4j Perl library and taking the time to documenting it thoroughly.

You might call REST::Neo4p an “Object-Graph Mapping”. It uses the Neo4j REST API at its foundation to interact with Neo4j, but what makes REST::Neo4p “Perly” is the object oriented approach. Creating node, relationship, or index objects is equivalent to looking them up in the graph database and, for nodes and relationships, creating them if they are not present. Updating the objects by setting properties or relationships results in the same actions in the database, and returns errors when these actions are proscribed by Neo4j. At the same time, object creation attempts to be as lazy as possible, so that only the portion of the database you are working with is represented in memory.

The idea is that working with the database is accomplished by using Perl5 objects in a Perl person’s favorite way. Despite the modules’ “REST” namespace, the developer should almost never need to deal with the actual REST calls or the building of URLs herself. The design uses the amazingly complete and consistent self-describing information in the Neo4j REST API responses to keep URLs under the hood.

A start on a Perl interface to Neo4j.

I am sure comments, testing and suggestions are welcome.

A Strong ARM for Big Data [Semantics Not Included]

Filed under: BigData,HPC,Semantics — Patrick Durusau @ 4:00 am

A Strong ARM for Big Data (Datanami – Sponsored Content by Calxeda)

From the post:

Burgeoning data growth is one of the foremost challenges facing IT and businesses today. Multiple analyst groups, including Gartner, have reported that information volume is growing at a minimum rate of 59 percent annually. At the same time, companies increasingly are mining this data for invaluable business insight that can give them a competitive advantage.

The challenge the industry struggles with is figuring out how to build cost-effective infrastructures so data scientists can derive these insights for their organizations to make timely, more intelligent decisions. As data volumes continue their explosive growth and algorithms to analyze and visualize that data become more optimized, something must give.

Past approaches that primarily relied on using faster, larger systems just are not able to keep pace. There is a need to scale-out, instead of scaling-up, to help in managing and understanding Big Data. As a result, this has focused new attention on different technologies such as in-memory databases, I/O virtualization, high-speed interconnects, and software frameworks such as Hadoop.

To take full advantage of these network and software innovations requires re-examining strategies for compute hardware. For maximum performance, a well-balanced infrastructure based on densely packed, power-efficient processors coupled with fast network interconnects is needed. This approach will help unlock applications and open new opportunities in business and high performance computing (HPC). (emphasis added)

I like powerful hardware as much as the next person. Either humming within earshot or making the local grid blink when it comes online.

Still, hardware/software tools for big data need to come with the warning label: “Semantics not included.

To soften the disappointment when big data appliances and/or software arrive and the bottom line stays the same, or gets worse.

Using big data, or rather effective use of big data, that is improving your bottom line, requires semantics, your semantics.

HBase at Hortonworks: An Update [Features, Consumer Side?]

Filed under: Hadoop,HBase,Hortonworks — Patrick Durusau @ 3:37 am

HBase at Hortonworks: An Update by Devaraj Das.

From the post:

HBase is a critical component of the Apache Hadoop ecosystem and a core component of the Hortonworks Data Platform. HBase enables a host of low latency Hadoop use-cases; As a publishing platform, HBase exposes data refined in Hadoop to outside systems; As an online column store, HBase supports the blending of random access data read/write with application workloads whose data is directly accessible to Hadoop MapReduce.

The HBase community is moving forward aggressively, improving HBase in many ways. We are in the process of integrating HBase 0.94 into our upcoming HDP 1.1 refresh. This “minor upgrade” will include a lot of bug fixes (nearly 200 in number) and quite a few performance improvements and will be wire compatible with HBase 0.92 (in HDP 1.0).

The post concludes:

All of the above is just what we’ve been doing recently and Hortonworkers are only a small fraction of the HBase contributor base. When one factors in all the great contributions coming from across the Apache HBase community, we predict 2013 is going to be a great year for HBase. HBase is maturing fast, becoming both more operationally reliable and more feature rich.

When a technical infrastructure becomes “feature rich,” can “features” for consumer services/interfaces be far behind?

Delivering location-based coupons for latte’s on a cellphone may seem like a “feature.” But we can do that with a man wearing a sandwich board.

A “feature” for the consumer needs to be more than digital imitation of an analog capability.

What consumer “feature(s)” would you offer based on new features in HBase?

October 21, 2012

7 John McCarthy Papers in 7 weeks – Prologue

Filed under: Artificial Intelligence,CS Lectures,Lisp — Patrick Durusau @ 6:28 pm

7 John McCarthy Papers in 7 weeks – Prologue by Carin Meier.

From the post:

In the spirit of Seven Languages in Seven Weeks, I have decided to embark on a quest. But instead of focusing on expanding my mindset with different programming languages, I am focusing on trying to get into the mindset of John McCarthy, father of LISP and AI, by reading and thinking about seven of his papers.

See Carin’s blog for progress so far.

I first saw this at John D. Cooks’s The Endeavor

How would you react to something similar for topic maps?

« Newer PostsOlder Posts »

Powered by WordPress