Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

October 5, 2012

Journal of Experimental Psychology: Applied

Filed under: Interface Research/Design,Language,Psychology — Patrick Durusau @ 2:16 pm

Journal of Experimental Psychology: Applied

From the website:

The mission of the Journal of Experimental Psychology: Applied® is to publish original empirical investigations in experimental psychology that bridge practically oriented problems and psychological theory.

The journal also publishes research aimed at developing and testing of models of cognitive processing or behavior in applied situations, including laboratory and field settings. Occasionally, review articles are considered for publication if they contribute significantly to important topics within applied experimental psychology.

Areas of interest include applications of perception, attention, memory, decision making, reasoning, information processing, problem solving, learning, and skill acquisition. Settings may be industrial (such as human–computer interface design), academic (such as intelligent computer-aided instruction), forensic (such as eyewitness memory), or consumer oriented (such as product instructions).

I browsed several recent issues of the Journal of Experimental Psychology: Applied while researching the Todd Rogers post. Fascinating stuff and some of it will find its way into interfaces or other more “practical” aspects of computer science.

Something to temper the focus on computer facing work.

No computer has ever originated a purchase order or contract. Might not hurt to know something about the entities that do.

Dodging and Topic Maps: Can Run but Can’t Hide

Filed under: Debate,Language — Patrick Durusau @ 2:05 pm

We have all been angry during televised debates when the “other” candidate slips by difficult questions.

To the partisan viewer it looks like they are lying and the moderator is in cahoots with them. They never get called down for failing to answer the question.

How come?

Alix Spiegel had a great piece on NPR called: How Politicians Get Away With Dodging The Question that may point in the right direction.

Research by Todd Rogers (homepage) of the Harvard School of Government demonstrates what is call a “pivot,” a point in an answer that starts to answer the question and then switches to something the candidate wanted to say.

It is reported that pivots were used about 70% of the time in one set of presidential debates.

In a similar vein, see: The Art of the Dodge by Peter Saalfield in Harvard Magazine, March-April 2012. (Watch for the bad link to the Journal of Experimental Psychology. Should be Journal of Experimental Psychology: Applied.

Or the original work:

The artful dodger: Answering the wrong question the right way. Rogers, Todd; Norton, Michael I. Journal of Experimental Psychology: Applied, Vol 17(2), Jun 2011, 139-147. doi: 10.1037/a0023439

Abtract:

What happens when speakers try to “dodge” a question they would rather not answer by answering a different question? In 4 studies, we show that listeners can fail to detect dodges when speakers answer similar—but objectively incorrect—questions (the “artful dodge”), a detection failure that goes hand-in-hand with a failure to rate dodgers more negatively. We propose that dodges go undetected because listeners’ attention is not usually directed toward a goal of dodge detection (i.e., Is this person answering the question?) but rather toward a goal of social evaluation (i.e., Do I like this person?). Listeners were not blind to all dodge attempts, however. Dodge detection increased when listeners’ attention was diverted from social goals toward determining the relevance of the speaker’s answers (Study 1), when speakers answered a question egregiously dissimilar to the one asked (Study 2), and when listeners’ attention was directed to the question asked by keeping it visible during speakers’ answers (Study 4). We also examined the interpersonal consequences of dodge attempts: When listeners were guided to detect dodges, they rated speakers more negatively (Study 2), and listeners rated speakers who answered a similar question in a fluent manner more positively than speakers who answered the actual question but disfluently (Study 3). These results add to the literatures on both Gricean conversational norms and goal-directed attention. We discuss the practical implications of our findings in the contexts of interpersonal communication and public debates. (PsycINFO Database Record (c) 2012 APA, all rights reserved)

Imagine an instant replay system for debates where pivots points are identified and additional data is mapped in.

Candidates would still try to dodge, but perhaps less successfully.

What’s so cool about elasticsearch?

Filed under: ElasticSearch,Search Engines — Patrick Durusau @ 6:04 am

What’s so cool about elasticsearch? by Luca Cavanna.

From the post:

Whenever there’s a new product out there and you start using it, suggest it to customers or colleagues, you need to be prepared to answer this question: “Why should I use it?”. Well, the answer could be as simple as “Because it’s cool!”, which of course is the case with elasticsearch, but then at some point you may need to explain why. I recently had to answer the question, “So what’s so cool about elasticsearch?”, that’s why I thought it might be worthwhile sharing my own answer in this blog.

Its not a staid comparison piece but a partisan, this is cool piece.

You will find it both entertaining and informative. Good weekend reading.

Will give you something to have a strong opinion (one way or the other) next Monday!

Parallella: A Supercomputer For Everyone

Filed under: Parallelism,Supercomputing — Patrick Durusau @ 5:40 am

Parallella: A Supercomputer For Everyone

For a $99 pledge you help make the Parallella computer a reality (and get one when produced).

  • Dual-core ARM A9 CPU
  • Epiphany Multicore Accelerator (16 or 64 cores)
  • 1GB RAM
  • MicroSD Card
  • USB 2.0 (two)
  • Two general purpose expansion connectors
  • Ethernet 10/100/1000
  • HDMI connection
  • Ships with Ubuntu OS
  • Ships with free open source Epiphany development tools that include C compiler, multicore debugger, Eclipse IDE, OpenCL SDK/compiler, and run time libraries.
  • Dimensions are 3.4” x 2.1”

Once completed, the Parallella computer should deliver up to 45 GHz of equivalent CPU performance on a board the size of a credit card while consuming only 5 Watts under typical work loads. Counting GHz, this is more horsepower than a high end server costing thousands of dollars and consuming 400W.

$99 to take a flyer on changing the fabric of supercomputing?

I’ll take that chance. How about you?

PS: Higher pledge amounts carry extra benefits, such as projected delivery of a beta version by December of 2012. ($5,000) Got a hard core geek on your holiday shopping list?

PPS: I first saw this at: Adapteva Launches Crowd-Source Funding for Its Floating Point Accelerator by Michael Feldman (HPC).

October 4, 2012

Could Cassandra be the first breakout NoSQL database?

Filed under: Cassandra,NoSQL — Patrick Durusau @ 4:58 pm

Could Cassandra be the first breakout NoSQL database? by Chris Mayer.

From the post:

Years of misunderstanding haven’t been kind to the NoSQL database. Aside from the confusing name (generally understood to mean ‘not only SQL’), there’s always been an air of reluctance from the enterprise world to move away from Oracle’s steady relational database, until there was a definite need to switch from tables to documents

The emergence of Big Data in the past few years has been the kickstart NoSQL distributors needed. Relational databases cannot cope with the sheer amount of data coming in and can’t provide the immediacy large-scale enterprises need to obtain information.

Open source offerings have been lurking in the background for a while, with the highly-tunable Apache Cassandra becoming a community favourite quickly. Emerging from the incubator in October 2011, Cassandra’s beauty lies in its flexible schema, its hybrid data model (lying somewhere between a key-value and tabular database) and also through its high availability. Being from the Apache Software Foundation, there’s also intrinsic links to the big data ‘kernel’ Apache Hadoop, and search server Apache Solr giving users an extra dimension to their data processing and storage.

Using NoSQL on cheap servers for processing and querying data is proving an enticing option for companies of all sizes, especially in combination with MapReduce technology to crunch it all.

One company that appears to be leading this data-driven charge is DataStax, who this week announced the completion of a $25 million C round of funding. Having already permeated the environments of some large companies (notably Netflix), the San Mateo startup are making big noises about their enterprise platform, melding the worlds of Cassandra and Hadoop together. Netflix is a client worth crowing about, with DataStax’s enterprise option being used as one of their primary data stores

Chris mentions some other potential players, MongoDB comes to mind, along with the Hadoop crowd.

I take the move from tables to documents as a symptom of deeper issue.

Relational databases rely on normalization to achieve their performance and reliability. So what happens if data is too large or coming too quickly to be normalized?

Relational databases remain the weapon of choice for normalized data but that doesn’t mean they work well with “dirty” data.

“Dirty data,” as opposed to “documents,” seems to catch the real shift for which NoSQL solutions are better adapted.

Your result are only as good as the data, but you know that up front. Not when you realize your “normalized” data, wasn’t.

That has to be a sinking feeling.

YARN Meetup at Hortonworks on Friday, Oct 12

Filed under: Hadoop,Hadoop YARN,Hortonworks — Patrick Durusau @ 4:35 pm

YARN Meetup at Hortonworks on Friday, Oct 12 by Russell Jurney.

From the post:

Hortonworks is hosting an Apache YARN Meetup on Friday, Oct 12, to solicit feedback on the YARN APIs. We’ve talked about YARN before in a four-part series on YARN, parts one, two, three and four.

YARN, or “Apache Hadoop NextGen MapReduce,” has come a long way this year. It is now a full-fledged sub-project of Apache Hadoop and has already been deployed on a massive 2,000 node cluster at Yahoo. Many projects, both open-src and otherwise, are porting to work in YARN such as Storm, S4 and many of them are in fairly advanced stages. We also have several individuals implementing one-off or ad-hoc application on YARN.

This meetup is a good time for YARN developers to catch up and talk more about YARN, it’s current status and medium-term and long-term roadmap.

OK, it’s probably too late to get cheap tickets but if you are in New York on the 12th of October, take advantage of the opportunity!

And please blog about the meeting, with a note to yours truly! I will post a link to your posting.

Adapting MapReduce for realtime apps

Filed under: Hadoop,MapReduce — Patrick Durusau @ 4:28 pm

Adapting MapReduce for realtime apps

From the post:

As much as MapReduce is popular, so much is the discussion to make it even better from a generalized approach to higher performance oriented approach. We will be discussing a few frameworks which have tried to adapt MapReduce further for higher performance orientation.

The first post in this series tries will discuss AMREF, an Adaptive MapReduce Framework designed for real time data intensive applications. (published in the paper Fan Zhang, Junwei Cao, Xiaolong Song, Hong Cai, Cheng Wu: AMREF: An Adaptive MapReduce Framework for Real Time Applications. GCC 2010: 157-162.)

If you are interested in squeezing more performance out of MapReduce, this looks like a good starting place.

GATE, NLTK: Basic components of Machine Learning (ML) System

Filed under: Machine Learning,Natural Language Processing,NLTK — Patrick Durusau @ 4:03 pm

GATE, NLTK: Basic components of Machine Learning (ML) System by Krishna Prasad.

From the post:

I am currently building a Machine Learning system. In this blog I want to captures the elements of a machine learning system.

My definition of a Machine Learning System is to take voice or text inputs from a user and provide relevant information. And over a period of time, learn the user behavior and provides him better information. Let us hold on to this comment and dissect apart each element.

In the below example, we will consider only text input. Let us also assume that the text input will be a freeflowing English text.

  • As a 1st step, when someone enters a freeflowing text, we need to understand what is the noun, what is the verb, what is the subject and what is the predicate. For doing this we need a Parts of Speech analyzer (POS), for example “I want a Phone”. One of the components of Natural Language Processing (NLP) is POS.
  • For associating relationship between a noun and a number, like “Phone greater than 20 dollers”, we need to run the sentence thru a rule engine. The terminology used for this is Semantic Rule Engine
  • The 3rd aspect is the Ontology, where in each noun needs to translate to a specific product or a place. For example, if someone says “I want a Bike” it should translate as “I want a Bicycle” and it should interpret that the company that manufacture a bicycle is BSA, or a Trac. We typically need to build a Product Ontology
  • Finally if you have buying pattern of a user and his friends in the system, we need a Recommendation Engine to give the user a proper recommendation

What would you add (or take away) to make the outlined system suitable as a topic map authoring assistant?

Feel free to add more specific requirements/capabilities.

I first saw this at DZone.

NASA Tournament Lab to Launch Big Data Challenge Series for U.S. Government Agencies

Filed under: BigData,Challenges,Contest — Patrick Durusau @ 3:34 pm

Big Data Challenge Series: NASA Tournament Lab to Launch Big Data Challenge Series for U.S. Government Agencies

Contest ends: Nov 12, 2012 05:00 PM EST

From the webpage:

NASA, the National Science Foundation (NSF), and the Department of Energy’s Office of Science, announced Oct. 3, 2012, the launch of the Big Data Challenge – a series of ideation competitions hosted through the NASA Tournament Lab (NTL). The Big Data Challenge series will apply the process of Open Innovation (OI) to the goal of conceptualizing new and novel approaches to utilizing “Big Data” information sets residing in various agency silos while remaining consistent with individual United States agencies missions related to the field of health, energy and earth sciences.

Competitors will be tasked with imagining analytical techniques and software tools that utilize Big Data from discrete government information domains and then describing how they may be shared as universal, cross-agency solutions that transcend the limitations of individual silos. The competition will be run by the NASA Tournament Lab (NTL), a collaboration between Harvard University and TopCoder, a competitive community of digital creators.

“The ability to create new applications and algorithms using diverse data sets is a key element of the NTL,” said Jason Crusan, Director of Advanced Exploration Systems at NASA’s Human Exploration and Operations Mission Directorate. “NASA is excited to see the results that open innovation can provide to these big data applications.”

You have to go to: studio.topcoder.com and have a topcoder account (but you have that already).

More than beer money and in time for the holiday season. Something to think about.

Learn to Speak DBA Slang

Filed under: Humor — Patrick Durusau @ 2:59 pm

Learn to Speak DBA Slang by Brent Ozar.

Too amusing to not pass along.

Given my interest in documentation, my favorite is:

Updating the last step in the Disaster Recovery Plan:…..

(see Brent’s post for the definition)

PostgreSQL Database Modeler

Filed under: Database,Modeling,PostgreSQL — Patrick Durusau @ 2:22 pm

PostgreSQL Database Modeler

From the readme file at github:

PostgreSQL Database Modeler, or simply, pgModeler is an open source tool for modeling databases that merges the classical concepts of entity-relationship diagrams with specific features that only PostgreSQL implements. The pgModeler translates the models created by the user to SQL code and apply them onto database clusters from version 8.0 to 9.1.

Other modeling tools you have or are likely to encounter writing topic maps?

When the output of diverse modeling tools or diverse output from the same modeling tool needs semantic reconciliation, I would turn to topic maps.

I first saw this at DZone.

R for Business Analytics

Filed under: Analytics,R — Patrick Durusau @ 2:11 pm

R for Business Analytics by A. Ohri.

I haven’t seen this volume, yet, but have read and cited Ohri’s blog, Decision Stats, often enough to have high expectations!

From the publisher’s blurb:

R for Business Analytics looks at some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its 4000 packages. With this information the reader can select the packages that can help process the analytical tasks with minimum effort and maximum usefulness. The use of Graphical User Interfaces (GUI) is emphasized in this book to further cut down and bend the famous learning curve in learning R. This book is aimed to help you kick-start with analytics including chapters on data visualization, code examples on web analytics and social media analytics, clustering, regression models, text mining, data mining models and forecasting. The book tries to expose the reader to a breadth of business analytics topics without burying the user in needless depth. The included references and links allow the reader to pursue business analytics topics.

This book is aimed at business analysts with basic programming skills for using R for Business Analytics. Note the scope of the book is neither statistical theory nor graduate level research for statistics, but rather it is for business analytics practitioners. Business analytics (BA) refers to the field of exploration and investigation of data generated by businesses. Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses. Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods. To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data. The R statistical software is the fastest growing analytics platform in the world, and is established in both academia and corporations for robustness, reliability and accuracy.

When you have seen it, please check back and post your comments.

Thanks!

Google Maps: A Prelude to Broader Predictive Search

Filed under: Interface Research/Design,Mapping,Maps — Patrick Durusau @ 2:01 pm

Google Maps: A Prelude to Broader Predictive Search by Stephen E. Arnold.

From the post:

Short honk. Google’s MoreThanaMap subsite signals an escalation in the map wars. You will want to review the information at www.morethanamap.com. The subsite presents the new look of Google’s more important features and services. The demonstrations are front and center.The focus is on visualization of mashed up data; that is, compound displays. The real time emphasis is clear as swell. The links point to developers and another “challenge.” It is clear that Google wants to make it desirable for programmers and other technically savvy individuals to take advantage of Google’s mapping capabilities. After a few clicks, Google has done a good job of making clear that findability and information access shift a map from a location service to a new interface.

You really need to see the demos to appreciate what can be done with the Google Map API.

Although, I remember the flight from Atlanta to Gatwick (London) as being longer than it seems in the demo. 😉

Scalable Machine Learning with Hadoop (most of the time)

Filed under: Hadoop,Machine Learning,Mahout — Patrick Durusau @ 1:44 pm

Scalable Machine Learning with Hadoop (most of the time) by Grant Ingersoll. (slides)

Grant’s slides from a presentation on machine learning with Hadoop in Taiwan!

Not quite like being there but still useful.

And a reminder that I need to get a copy of Taming Text!

October 3, 2012

Every Lost episode visualized and recreated

Filed under: Entertainment,Marketing — Patrick Durusau @ 8:43 pm

Every Lost episode visualized and recreated by Nathan Yau.

From the post:

Santiago Ortiz visualized every episode of the show in the interactive Lostalgic. It’s a set a four views that shows character occurrences and relationships and the lines they said during various parts of each episode.

The first view, shown above, is a bar chart vertically arranged by time, where each row represents an act. A profile picture is shown whenever the corresponding character says something. The next two views, the network graph and co-occurrence matrix show interactions between characters, and finally, if you want to relive it all over again, you can choose the reenactment, and the animation will cycle through the characters and scripts.

I have a confession to make before going any further: I have never seen an espisode of “Lost.” You have been warned.

Despite my ignorance of the show, this appears to be a truly amazing project.

I am sure there are fans of other TV shows who would volunteer to do something similar for their favorite show.

Of course, I would like to see them use topic maps, if for no other reason than to enable decentralized work flow and diverse semantic viewpoints of the same content.

Can you imagine an American Bandstand project on Github?

What TV series would you spend this sort of time on?

CDH4.1 Now Released!

Filed under: Cloudera,Flume,Hadoop,HBase,HDFS,Hive,Pig — Patrick Durusau @ 8:28 pm

CDH4.1 Now Released! by Charles Zedlewski.

From the post:

We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:

  • Quorum based storage – Quorum-based Storage for HDFS provides the ability for HDFS to store its own NameNode edit logs, allowing you to run a highly available NameNode without external storage or custom fencing.
  • Hive security and concurrency – we’ve fixed some long standing issues with running Hive. With CDH4.1, it is now possible to run a shared Hive instance where users submit queries using Kerberos authentication. In addition this new Hive server supports multiple users submitting queries at the same time.
  • Support for DataFu – the LinkedIn data science team was kind enough to open source their library of Pig UDFs that make it easier to perform common jobs like sessionization or set operations. Big thanks to the LinkedIn team!!!
  • Oozie workflow builder – since we added Oozie to CDH more than two years ago, we have often had requests to make it easier to develop Oozie workflows. The newly enhanced job designer in Hue enables users to use a visual tool to build and run Oozie workflows.
  • FlumeNG improvements – since its release, FlumeNG has become the backbone for some exciting data collection projects, in some cases collecting as much as 20TB of new event data per day. In CDH4.1 we added an HBase sink as well as metrics for monitoring as well as a number of performance improvements.
  • Various performance improvements – CDH4.1 users should experience a boost in their MapReduce performance from CDH4.0.
  • Various security improvements – CDH4.1 enables users to configure the system to encrypt data in flight during the shuffle phase. CDH now also applies Hadoop security to users who access the filesystem via a FUSE mount.

It’s releases like this that make me wish I spent more time writing documentation for software. To try out all the cool features with no real goal other than trying them out.

Enjoy!

Hunting Trolls with Neo4j!

Filed under: Citation Analysis,Citation Indexing,Graphs,Neo4j — Patrick Durusau @ 8:19 pm

Hunting Trolls with Neo4j! by Max De Marzi.

Max quotes from a video found by Alison Sparrow:

What we tried to do with it, is bypass any sort of keyword processing in order to find similar patents. The reason we’ve done this is to avoid the problems encountered by other systems that rely on natural language processing or semantic analysis simply because patents are built to avoid detection by similar keywords…we use network topology (specifically citation network topology) to mine the US patent database in order to predict similar documents.

The “note pad” in the demonstration would be more useful if it produced a topic map that merged results from other searchers.

Auto-magically creating associations based on data extracted from the patent database would be a nice feature as well.

Maybe I should get some sticky pads printed up with the logo: “You could be using a topic map!” 😉

(Let me know how many sticky pads you would like and I will get a quote for them.)

At or Near Final Calls on W3C Provenance

Filed under: HTML,Provenance — Patrick Durusau @ 7:48 pm

I saw a notice today about the ontology part of the W3C work on provenance. Some of it is at final call or nearly so. If you are interested, see:

  • PROV-DM, the PROV data model for provenance;
  • PROV-CONSTRAINTS, a set of constraints applying to the PROV data model;
  • PROV-N, a notation for provenance aimed at human consumption;
  • PROV-O, the PROV ontology, an OWL2 ontology allowing the mapping of PROV to RDF;
  • PROV-AQ, the mechanisms for accessing and querying provenance;
  • PROV-PRIMER, a primer for the PROV data model.

My first impression is the provenance work is more complex than HTML 3.2 and therefore unlikely to see widespread adoption. (You may want to bookmark that link. It isn’t listed on the HTML page at the W3C, even under obsolete versions.)

Big Learning with Graphs by Joey Gonzalez (Video Lecture)

Filed under: GraphLab,Graphs,Machine Learning — Patrick Durusau @ 5:00 am

Big Learning with Graphs by Joey Gonzalez by Marti Hearst.

From the post:

For those of you who follow the latest developments in the Big Data technology stack, you’ll know that GraphLab is the hottest technology for processing huge graphs in fast time. We got to hear the algorithms behind GraphLab 2 even before the OSDI crowd! Check it out:

Slides.

GraphLab homepage.

For when you want to move up to parallel graph processing.

News Reporting, Not Just DHS Fusion Centers, Ineffectual

Filed under: Intelligence,News,Security — Patrick Durusau @ 4:20 am

A report by the United States Senate, PERMANENT SUBCOMMITTEE ON INVESTIGATIONS, Committee on Homeland Security and Governmental Affairs, FEDERAL SUPPORT FOR AND INVOLVEMENT IN STATE AND LOCAL FUSION CENTERS (link to page with actual report), was described this way in the New York Times coverage:

One of the nation’s biggest domestic counterterrorism programs has failed to provide virtually any useful intelligence, according to Congressional investigators.

Their scathing report, to be released Wednesday, looked at problems in regional intelligence-gathering offices known as “fusion centers” that are financed by the Department of Homeland Security and created jointly with state and local law enforcement agencies.

The report found that the centers “forwarded intelligence of uneven quality — oftentimes shoddy, rarely timely, sometimes endangering citizens’ civil liberties and Privacy Act protections, occasionally taken from already published public sources, and more often than not unrelated to terrorism.”

The investigators reviewed 610 reports produced by the centers over 13 months in 2009 and 2010. Of these, the report said, 188 were never published for use within the Homeland Security Department or other intelligence agencies. Hundreds of draft reports sat for months, awaiting review by homeland security officials, making much of their information obsolete. And some of the reports appeared to be based on previously published information or facts that had on long since been reported through the Federal Bureau of Investigation.

What is remarkable about a link to a page with the actual report?

After reading the New York Times article, I looked for a link in the article to the report. Nada. Zip. The null string. No link.

Searching over news reports from other major news outlets, same result.

Searching the US Senate, PERMANENT SUBCOMMITTEE ON INVESTIGATIONS website, at least as of 5:00 AM Eastern Standard time on October 3, 2012, fails to produce the report.

We aren’t lacking the “semantic web.”

There is a lack of linking to information sources. Links empower the reader to make their own judgements.

I expect “shoddy reporting” from the Department of Homeland Security. I don’t expect it from the New York Times. Or other major news outlets.

The report will be a “brief flash in the pan.” The news cycle will move onto the latest political gaffe or fraud, just as DHS folk move onto other ineffectual activities.

Would be nice to link up names, events, etc., from the report, to past and future mentions of the same people and events.

Imagine Senator Levin asking: “This is your fifth appearance on questionable spending of government funds, in four separate agencies, under two different administrations?”

Accountability and transparency, a topic maps double shot.

October 2, 2012

Twitter Results Recipe with Gephi Garnish

Filed under: Gephi,Google Refine,Graphics,Tweets — Patrick Durusau @ 7:23 pm

Grabbing Twitter Search Results into Google Refine And Exporting Conversations into Gephi by Tony Hirst.

From the post:

How can we get a quick snapshot of who’s talking to whom on Twitter in the context of a particular hashtag?

What follows is a detailed recipe with the answer to that question.

JSONiq

Filed under: JSON,JSONiq,XQuery — Patrick Durusau @ 7:18 pm

JSONiq: The JSON Query Language

From the webpage:

JSONiq extends XQuery, a mature W3C standard, with native JSON support. Like XQuery and SQL, JSONiq is declarative: Expressions can nest with full composability.

Project, Filter, Join, Group… Like SQL, JSONiq can do all that. And it has many more features inherited from XQuery. JSONiq also inherits all XQuery builtin functions: date times, string manipulation, regular expressions, and more.

JSOniq is an expressive and highly optimizable language to query and update NoSQL stores. It enables developers to leverage the same productive high-level language across a variety of NoSQL products.

This came in over the nosql-discuss mailing list a day or so ago.

Sounds promising. Any early comments?

Neo4j 1.8 Release – Fluent Graph Literacy

Filed under: Cypher,Graphs,Neo4j — Patrick Durusau @ 7:08 pm

Neo4j 1.8 Release – Fluent Graph Literacy by Andreas Kollegger.

From the post:

Available immediately, Neo4j 1.8 offers a delightful experience for reading and writing graph data with the simple expressiveness of the Cypher language. Whether you’re just discovering the social power of Facebook’s Open Graph or are building your own Knowledge Graph for Master Data Management, speaking in graph is easy with Cypher. It is the key to making sense of data with Neo4j.

Another major milestone in Neo4j development!

Tapping the Data Deluge with R

Filed under: Data Mining,R — Patrick Durusau @ 4:31 pm

Tapping the Data Deluge with R by Jeffrey Breen.

Jeffrey points to slides and other resources for a presentation he made on accessing standard and not so standard sources of data with R.

HAcid: A lightweight transaction system for HBase

Filed under: HBase — Patrick Durusau @ 4:22 pm

HAcid: A lightweight transaction system for HBase

From the post:

HAcid is a client library that applications can use for operating multi-row transactions in HBase. Seems to be motivated by Google’s Percolator.

Link to the original paper

Apache Stanbol graduates to Top-Level Project

Filed under: Semantics,Stanbol — Patrick Durusau @ 4:15 pm

Apache Stanbol graduates to Top-Level Project

From the post:

The Apache Software Foundation (ASF) has announced that Apache Stanbol has graduated from project incubation. Stanbol is an open source Java stack designed to interface with a content management system (CMS) to enhance it with semantic information. With the elevation to a Top-Level Project, the ASF recognises that the project’s community has been “well-governed” according to the foundation’s principles and follows “The Apache Way” for running a project.

Stanbol is a modular collection of reasoning engines, content enhancers and components to manage rules and metadata for content fed into the framework, all wrapped with a RESTful API and orchestrated within an Apache Felix OSGi container. A CMS adapter allows the system to connect to content management systems from which it can extract data to use in evaluating and developing rules and annotations.

The RESTful API can then be used to provide semantic information for content from a different source based upon information the server has previously analysed. Stanbol is more of a collection of reusable components than a complete solution for semantic searching, however. It is designed to work alongside CMS systems and existing search software.

I suppose too much *nix experience has made me suspicious of “complete solutions” for anything. Components, particularly interchangeable ones, seem a lot more robust.

Got big JSON? BigQuery expands data import for large scale web apps

Filed under: Big Query,BigData,Google BigQuery,JSON — Patrick Durusau @ 4:08 pm

Got big JSON? BigQuery expands data import for large scale web apps by Ryan Boyd, Developer Advocate.

From the post:

JSON is the data format of the web. JSON is used to power most modern websites, is a native format for many NoSQL databases hosting top web applications, and provides the primary data format in many REST APIs. Google BigQuery, our cloud service for ad-hoc analytics on big data, has now added support for JSON and the nested/repeated structure inherent in the data format.

JSON opens the door to a more object-oriented view of your data compared to CSV, the original data format supported by BigQuery. It removes the need for duplication of data required when you flatten records into CSV. Here are some examples of data you might find a JSON format useful for:

  • Log files, with multiple headers and other name-value pairs.
  • User session activities, with information about each activity occurring nested beneath the session record.
  • Sensor data, with variable attributes collected in each measurement.

Nested/repeated data support is one of our most requested features. And while BigQuery’s underlying infrastructure supports it, we’d only enabled it in a limited fashion through M-Lab’s test data. Today, however, developers can use JSON to get any nested/repeated data into and out of BigQuery.

It had to happen. “Big Json” that is.

My question is when “Bigger Data” is going to catch on?

If you got far enough ahead, say six to nine months, you could copyright something like “Biggest Data” and start collecting fees when it comes into common usage.

Graph Drawing talks are online

Filed under: Graphs,Networks,Visualization — Patrick Durusau @ 3:40 pm

Graph Drawing talks are online

From the post:

This year’s graph drawing symposium was located at Microsoft, and thanks to Microsoft the talks are now all online. So if you wanted to go but couldn’t, you can still see what you missed.

Turns out anyone not there, missed a lot!

One of the jewels I am watching right now is Ben Shneiderman and Cody Dunne.

Suggest “graph drawing” should be renamed “graph discovery.”

Paraphrase: “the purpose of graph drawing is not pictures but discovery.”

Torque for mapping temporal data

Filed under: Graphics,HTML5,Mapping,Temporal Data,Visualization — Patrick Durusau @ 2:49 pm

Torque for mapping temporal data by Nathan Yau.

From the post:

Mapping data over time can be challenging, especially when you have a lot of data to load in the beginning. Torque, the new open source project by CartoDB, is a step towards making the process easier.

Torque allows you to create beautiful visualizations with big temporal datasets by bundling HTML5 browser rendering technologies with a generic and efficient temporal data transfer format created using the CartoDB SQL API. Torque visualisations work on desktop and ipads, and work well on temporal datasets with hundreds of thousands or even millions of datapoints.

Isn’t data always mapped over time?

Data always originates at a time, observed at a time, recorded at a time (by an observer, mechanical or otherwise), is valid through a time, etc.

We may omit time for some reason or purpose but that is our choice.

Solr vs ElasticSearch: Part 3 – Searching

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 9:21 am

Solr vs ElasticSearch: Part 3 – Searching by Rafał Kuć.

From the post:

In the last two parts of the series we looked at the general architecture and how data can be handled in both Apache Solr 4 (aka SolrCloud) and ElasticSearch and what the language handling capabilities of both enterprise search engines are like. In today’s post we will discuss one of the key parts of any search engine – the ability to match queries to documents and retrieve them.

  • Solr vs. ElasticSearch: Part 1 – Overview
  • Solr vs. ElasticSearch: Part 2 – Indexing and Language Handling
  • Solr vs. ElasticSearch: Part 3 – Searching
  • Solr vs. ElasticSearch: Part 4 – Faceting
  • Solr vs. ElasticSearch: Part 5 – API Usage Possibilities

Definitely a series to follow.

« Newer PostsOlder Posts »

Powered by WordPress