Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 13, 2012

Hafslund SESAM – Semantic integration in practice

Filed under: Integration,Semantics — Patrick Durusau @ 10:57 am

Hafslund SESAM – Semantic integration in practice by Lars Marius Garshol.

Lars has posted his slides from a practical implementation of semantic integration, and what he saw along the way.

I particularly liked the line:

Generally, archive systems are glorified trash cans – putting it in the archive effectively means hiding it

BTW, Lars mentions he has a paper on this project. If you are looking for publishable semantic integration content, you might want to ping him.

larsga@bouvet.no
http://twitter.com/larsga

4th Workshop on Complex Networks, 2013

Filed under: Conferences,Graphs,Networks — Patrick Durusau @ 10:42 am

4th Workshop on Complex Networks, 2013

From the call for papers:

The 4th international workshop on complex networks (CompleNet 2013) aims at bringing together researchers and practitioners working on areas related to complex networks. In the past two decades we have been witnessing an exponential increase on the number of publications in this field. From biological systems to computer science, from economic to social systems, complex networks are becoming pervasive in many fields of science. It is this interdisciplinary nature of complex networks that this workshop aims at addressing.

Authors are encouraged to submit previously unpublished papers/abstracts on their research in complex networks. Both theoretical and applied papers are of interest. Specific topics of interest are (but not limited to):

  • Applications of Network Science
  • Behavioral & Social Influence
  • Community Structure in Networks
  • Complex Network in Technology
  • Complex Networks and Epidemics
  • Complex Networks and Mobility
  • Complex Networks in Biological Systems
  • Emergence in Complex Networks
  • Geometry in Complex Networks
  • Information Spreading in Social Media,
  • Link Analysis and Ranking
  • Modeling Human Behavior in Complex Networks
  • Models of Complex Networks
  • Network Evolution
  • Networks as Frameworks
  • Rumor Spreading
  • Search in Complex Networks
  • Shocks and Bursts
  • Social Networks
  • Structural Network Properties and Analysis
  • Synchronization in Networks

Prior programs as a guide to what you will encounter:

CompleNet 2012

CompleteNet 2010 (PDF file)

Prison Polling [If You Don’t Ask, You Won’t Know]

Filed under: Data,Design,Statistics — Patrick Durusau @ 9:44 am

Prison Polling by Carl Bialik.

From the post:

My print column examines the argument of a book out this week that major federal surveys are missing an important part of the population by not polling prisoners.

“We’re missing 1% of the population,” said Becky Pettit, a University of Washington sociologist and author of the book, “Invisible Men.” “People might say, ‘That’s not a big deal.’ “But it is for some groups, she writes — particularly young black men. And for young black men, especially those without a high-school diploma, official statistics paint a rosier picture than reality on factors such as employment and voter turnout.

“Because many surveys skip institutionalized populations, and because we incarcerate lots of people, especially young black men with low levels of education, certain statistics can look rosier than if we included” prisoners in surveys, said Jason Schnittker, a sociologist at the University of Pennsylvania. “Whether you regard the impact as ‘massive’ depends on your perspective. The problem of incarceration tends to get swept under the rug in lots of different ways, rendering the issue invisible.”

A reminder that assumptions are cooked into data long before it reaches us for analysis.

If we don’t ask questions about data collection, we may be passing on results that don’t serve the best interests of our clients.

So for population data, ask (among other things):

  • Who was included/excluded?
  • How were the included selected?
  • On what basis were people excluded?
  • Where are the survey questions?
  • By what means were the questions asked? (phone, web, in person)
  • Time of day of survey?

and I am sure there are others.

Don’t be impressed by protests that your questions are irrelevant or the source has already “accounted” for that issue.

Right.

When someone protests you don’t need to know, you know where to push. Trust me on that one.

GIMP Magazine

Filed under: GIMP,Graphics,Visualization — Patrick Durusau @ 8:56 am

GIMP Magazine

I guess you know software has “arrived” when it graduates to having a zine! 😉

As if you more validation than 6.8 million downloads in two months (Gimp 2.0).

A very high end graphics tool, mastery of which will serve you well.

Or so I am told. I can open/close and dabble a bit.

Time to change that and this may be the right encouragement at the right time to make that happen.

Enjoy!

PS: Illustrations/images of topic maps welcome!

September 12, 2012

Wikipedia is dominated by male editors

Filed under: Graphics,Statistics,Visualization — Patrick Durusau @ 7:27 pm

Wikipedia is dominated by male editors by Nathan Yau.

From the post:

After he saw a New York Times article on the gender gap among Wikipedia contributors (The contributor base is only 13 percent women), Santiago Ortiz plotted articles by number of men versus number of women who edited. It’s interactive, so you can mouse over dots to see what article each represents, and you can zoom in for closer look in the bottom left.

This graphic merits wide circulation.

There isn’t a recipe for how to make such an effective graphic, other than perhaps to have studied equally effective graphics.

I will try to hunt down an example I saw many years ago that plotted population versus representation at the United Nations. If I can find it, you can draw your own conclusions about it.

In the mean time, if you spot graphics/visualizations that are clearly a cut above others, please share.

PostgreSQL 9.2 released

Filed under: Database,PostgreSQL — Patrick Durusau @ 7:12 pm

PostgreSQL 9.2 released

From the announcement:

The PostgreSQL Global Development Group announces PostgreSQL 9.2, the latest release of the leader in open source databases. Since the beta release was announced in May, developers and vendors have praised it as a leap forward in performance, scalability and flexibility. Users are expected to switch to this version in record numbers.

“PostgreSQL 9.2 will ship with native JSON support, covering indexes, replication and performance improvements, and many more features. We are eagerly awaiting this release and will make it available in Early Access as soon as it’s released by the PostgreSQL community,” said Ines Sombra, Lead Data Engineer, Engine Yard.

Links

Downloads, including packages and installers
Release Notes
Documentation
What’s New in 9.2
Press Kit

New features like range types:

Range types are used to store a range of data of a given type. There are a few pre-defined types. They are integer (int4range), bigint (int8range), numeric (numrange), timestamp without timezone (tsrange), timestamp with timezone (tstzrange), and date (daterange).

Ranges can be made of continuous (numeric, timestamp…) or discrete (integer, date…) data types. They can be open (the bound isn’t part of the range) or closed (the bound is part of the range). A bound can also be infinite.

Without these datatypes, most people solve the range problems by using two columns in a table. These range types are much more powerful, as you can use many operators on them.

have captured my attention.

Now to look at other new features: Index-only scans, Replication improvements and JSON datatype.

Pushing Parallel Barriers Skyward (Subject Identity at 1EB/year)

Filed under: Astroinformatics,BigData,Subject Identity — Patrick Durusau @ 5:50 pm

Pushing Parallel Barriers Skyward by Ian Armas Foster

From the post:

As much data as there exists on the planet Earth, the stars and the planets that surround them contain astronomically more. As we discussed earlier, Peter Nugent and the Palomar Transient Factory are using a form of parallel processing to identify astronomical phenomena.

Some researchers believe that parallel processing will not be enough to meet the huge data requirements of future massive-scale astronomical surveys. Specifically, several researchers from the Korea Institute of Science and Technology Information including Jaegyoon Hahm along with Yongsei University’s Yong-Ik Byun and the University of Michigan’s Min-Su Shin wrote a paper indicating that the future of astronomical big data research is brighter with cloud computing than parallel processing.

Parallel processing is holding its own at the moment. However, when these sky-mapping and phenomena-chasing projects grow significantly more ambitious by the year 2020, parallel processing will have no hope.

How ambitious are these future projects? According to the paper, the Large Synoptic Survey Telescope (LSST) will generate 75 petabytes of raw plus catalogued data for its ten years of operation, or about 20 terabytes a night. That pales in comparison to the Square Kilometer Array, which is projected to archive in one year 250 times the amount of information that exists on the planet today.

“The total data volume after processing (the LSST) will be several hundred PB, processed using 150 TFlops of computing power. Square Kilometer Array (SKA), which will be the largest in the world radio telescope in 2020, is projected to generate 10-100PB raw data per hour and archive data up to 1EB every year.”

Beyond storage/processing requirements, how do you deal with subject identity at 1EB/year?

Changing subject identity that is.

People are as inconstant with subject identity as they are with martial fidelity. If they do that well.

Now spread that over decades or centuries of research.

Does anyone see a problem here?

Cloudera Enterprise in Less Than Two Minutes

Filed under: Cloud Computing,Cloudera,Hadoop,MapReduce — Patrick Durusau @ 4:10 pm

Cloudera Enterprise in Less Than Two Minutes by Justin Kestelyn.

I had to pause “Born Under A Bad Sign” by Cream to watch the video but it was worth it!

Good example of selling technique too!

Focused on common use cases and user concerns. Promises a solution without all the troublesome details.

Time enough for that after a prospect is interested. And even then, ease them into the details.

Coding to the Twitter API

Filed under: CS Lectures,Tweets — Patrick Durusau @ 3:50 pm

Coding to the Twitter API by Marti Hearst.

From the post:

Today Rion Snow saved us huge amounts of time by giving us a primo introduction to the Twitter API. We learned about both the RESTful API and the streaming API for both Java and Python.

A very cool set of slides!

Just the right amount of detail and amusement. Clearly an experienced presenter!

Fast integer compression: decoding billions of integers per second

Filed under: Algorithms,Integers,Search Engines — Patrick Durusau @ 3:14 pm

Fast integer compression: decoding billions of integers per second by Daniel Lemire.

At > 2 billion integers per second, you may find there is plenty of action left in your desktop processor!

From the post:

Databases and search engines often store arrays of integers. In search engines, we have inverted indexes that map a query term to a list of document identifiers. This list of document identifiers can be seen as a sorted array of integers. In databases, indexes often work similarly: they map a column value to row identifiers. You also get arrays of integers in databases through dictionary coding: you map all column values to an integer in a one-to-one manner.

Our modern processors are good at processing integers. However, you also want to keep much of the data close to the CPU for better speed. Hence, computer scientists have worked on fast integer compression techniques for the last 4 decades. One of the earliest clever techniques is Elias coding. Over the years, many new techniques have been developed: Golomb and Rice coding, Frame-of-Reference and PFOR-Delta, the Simple family, and so on.

The general story is that while people initially used bit-level codes (e.g., gamma codes), simpler byte-level codes like Google’s group varint are more practical. Byte-level codes like what Google uses do not compress as well, and there is less opportunity for fancy information theoretical mathematics. However, they can be much faster.

Yet we noticed that there was no trace in the literature of a sensible integer compression scheme running on desktop processor able to decompress data at a rate of billions of integers per second. The best schemes, such as Stepanov et al.’s varint-G8IU report top speeds of 1.5 billion integers per second.

As your may expect, we eventually found out that it was entirely feasible to decoding billions of integers per second. We designed a new scheme that typically compress better than Stepanov et al.’s varint-G8IU or Zukowski et al.’ PFOR-Delta, sometimes quite a bit better, while being twice as fast over real data residing in RAM (we call it SIMD-BP128). That is, we cleanly exceed a speed of 2 billions integers per second on a regular desktop processor.

We posted our paper online together with our software. Note that our scheme is not patented whereas many other schemes are.

So, how did we do it? Some insights:

Welcome Hortonworks Data Platform 1.1

Filed under: Flume,Hadoop,Hortonworks — Patrick Durusau @ 10:30 am

Welcome Hortonworks Data Platform 1.1 by Jim Walker.

From the post:

Hortonworks Data Platform 1.1 Brings Expanded High Availability and Streaming Data Capture, Easier Integration with Existing Tools to Improve Enterprise Reliability and Performance of Apache Hadoop

It is exactly three months to the day that Hortonworks Data Platform version 1.0 was announced. A lot has happened since that day…

  • Our distribution has been downloaded by thousands and is delivering big value to organizations throughout the world,
  • Hadoop Summit gathered over 2200 Hadoop enthusiasts into the San Jose Convention Center,
  • And, our Hortonworks team grew by leaps and bounds!

In these same three months our growing team of committers, engineers, testers and writers have been busy knocking out our next release, Hortonworks Data Platform 1.1. We are delighted to announce availability of HDP 1.1 today! With this release, we expand our high availability options with the addition of Red Hat based HA, add streaming capability with Flume, expand monitoring API enhancements and have made significant performance improvements to the core platform.

New features include high availability, capturing data streams (Flume), improved operations management and performance increases.

For the details, see the post, documentation or even download Hortonworks Data Platform 1.1 for a spin.

Unlike Odo’s Klingon days, a day with several items from Hortonworks is a good day. Enjoy!

How To Take Big Data to the Cloud [Webinar – 13 Sept 2012 – 10 AM PDT]

Filed under: BigData,Cloud Computing,Hortonworks — Patrick Durusau @ 10:17 am

How To Take Big Data to the Cloud by Lisa Sensmeier.

From the post:

Hortonworks boasts a rich and vibrant ecosystem of partners representing a huge array of solutions that leverage Hadoop, and specifically Hortonworks Data Platform, to provide big data insights for customers. The goal of our Partner Webinar Series is to help communicate the value and benefit of our partners’ solutions and how they connect and use Hortonworks Data Platform.

Look to the CloudsBig-Data-and-the-cloud

Setting up a big data cluster can be difficult, especially considering the assembly of all the all the equipment, power, and space to make it happen. One option to consider is using the cloud for a practical and economical way to go. The cloud is also used to provide extra capacity for an existing cluster or for test your Hadoop applications.

Join our webinar and we will show how you can build a flexible and reliable Hadoop cluster in the cloud using Amazon EC2 cloud infrastructure, StackIQ Apache Hadoop Amazon Machine Image (AMI) and Hortonworks Data Platform. The panel of speakers includes Matt Tavis, Solutions Architect for Amazon Web Services, Mason Katz, CTO and co-founder of StackIQ, and Rohit Bakhshi, Product Manager at Hortonworks.

OK, it is a vendor/partner presentation but most of us work for vendors and use vendor created tools.

Yes?

The real question is whether tool X does what is necessary at a cost project Y can afford?

Whether vendor sponsored tool, service, home grown or otherwise.

Yes?

Looking forward to it!

Apache Hadoop YARN – NodeManager

Filed under: Hadoop YARN,Hortonworks — Patrick Durusau @ 10:06 am

Apache Hadoop YARN – NodeManager by Vinod Kumar Vavilapalli

From the post:

In the previous post, we briefly covered the internals of Apache Hadoop YARN’s ResourceManager. In this post, which is the fourth in the multi-part YARN blog series, we are going to dig deeper into the NodeManager internals and some of the key-features that NodeManager exposes. Part one, two and three are available.

Introduction

The NodeManager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.

Administration isn’t high on the “exciting” list, although without good administration, things can get very “exciting.”

NodeManager gives you the monitoring tools to help avoid the latter form of excitement.

A Raspberry Pi Supercomputer

Filed under: Computer Science,Parallel Programming,Supercomputing — Patrick Durusau @ 9:55 am

A Raspberry Pi Supercomputer

If you need a supercomputer for processing your topic maps, an affordable one is at hand.

Some assembly required. With Legos no less.

From the ScienceDigest post:

Computational Engineers at the University of Southampton have built a supercomputer from 64 Raspberry Pi computers and Lego.

The team, led by Professor Simon Cox, consisted of Richard Boardman, Andy Everett, Steven Johnston, Gereon Kaiping, Neil O’Brien, Mark Scott and Oz Parchment, along with Professor Cox’s son James Cox (aged 6) who provided specialist support on Lego and system testing.

Professor Cox comments: “As soon as we were able to source sufficient Raspberry Pi computers we wanted to see if it was possible to link them together into a supercomputer. We installed and built all of the necessary software on the Pi starting from a standard Debian Wheezy system image and we have published a guide so you can build your own supercomputer.”

The racking was built using Lego with a design developed by Simon and James, who has also been testing the Raspberry Pi by programming it using free computer programming software Python and Scratch over the summer. The machine, named “Iridis-Pi” after the University’s Iridis supercomputer, runs off a single 13 Amp mains socket and uses MPI (Message Passing Interface) to communicate between nodes using Ethernet. The whole system cost under £2,500 (excluding switches) and has a total of 64 processors and 1Tb of memory (16Gb SD cards for each Raspberry Pi). Professor Cox uses the free plug-in ‘Python Tools for Visual Studio’ to develop code for the Raspberry Pi.

You may also want to visit the Rasberry PI Foundation. Which has the slogan: “An ARM GNU/Linux box for $25. Take a byte!”

In an age with ready access to cloud computing resources, to say nothing of weapon quality toys (Playstation 3’s), for design simulations, there is still a place for inexpensive experimentation.

What hardware configurations will you test out on your Raspberry Pi Supercomputer?

Are there specialized configurations that work better for some subject identity tests than others?

How do hardware constraints influence our approaches to computational problems?

Are we missing solutions because they don’t fit current architectures and therefore aren’t considered? (Not rejected, just don’t come up at all.)

Do You Just Talk About The Weather?

Filed under: Dataset,Machine Learning,Mahout,Weather Data — Patrick Durusau @ 9:24 am

After reading this post by Alex you will still just be talking about the weather, but you may have something interesting to say. 😉

Locating Mountains and More with Mahout and Public Weather Dataset by Alex Baranau

From the post:

Recently I was playing with Mahout and public weather dataset. In this post I will describe how I used Mahout library and weather statistics to fill missing gaps in weather measurements and how I managed to locate steep mountains in US with a little Machine Learning (n.b. we are looking for people with Machine Learning or Data Mining backgrounds – see our jobs).

The idea was to just play and learn something, so the effort I did and the decisions chosen along with the approaches should not be considered as a research or serious thoughts by any means. In fact, things done during this effort may appear too simple and straightforward to some. Read on if you want to learn about the fun stuff you can do with Mahout!
Tools & Data

The data and tools used during this effort are: Apache Mahout project and public weather statistics dataset. Mahout is a machine learning library which provided a handful of machine learning tools. During this effort I used just small piece of this big pie. The public weather dataset is a collection of daily weather measurements (temperature, wind speed, humidity, pressure, &c.) from 9000+ weather stations around the world.

What other questions could you explore with the weather data set?

The real power of “big data” access and tools may be that we no longer have to rely on the summaries of others.

Summaries still have a value-add, perhaps even more so when the original data is available for verification.

September 11, 2012

XML-Print 1.0

Filed under: Mapping,Visualization,XML — Patrick Durusau @ 2:46 pm

Prof. Marc W. Küster announced XML-Print 1.0 this week, “…an open source XML formatter designated especially for the needs of the Digital Humanties.”

Mapping from “…semantic structures to typesetting styles….” (from below)

We have always mapped from semantic structures to typesetting styles, but this time it will be explicit.

Consider whether you need “transformation” (implies a static file output) or merely a “view” for some purpose, such as printing?

Both require mappings but the later keeps your options open as it were.

Enjoy!

XML-Print allows the end user to directly interact with semantically annotated data. It consists of two independent, but well-integrated components, an Eclipse-based front-end that enables the user to map their semantic structures to typesetting styles, and the typesetting engine proper that produces the PDF based on this mapping. Both components build as much as possible on existing standards such as XML, XSL-T and XSL-FO and extend those only where absolutely necessary, e.g. for the handling of critical apparatuses.

XML-Print is a DFG-supported joint project of the FH Worms (Prof. Marc W. Küster) and the University of Trier (Prof. Claudine Moulin) in collaboration wiht the TU Darmstadt (Prof. Andrea Rapp). It is released under the Eclipse Public Licence (EPL) for the front-end and the Affero General Public Licence (APGL) for the typesetting engine. The project is currently roughly half-way through its intended duration. In its final incarnation the PDF that is produced will satisfy the full set of requirements for the typesetting of (amongst others) critical editions including critical apparatuses, multicolumn synoptic texts and bidirectional text. At this stage it can already handle basic formatting as well as multiple apparatuses, albeit still with some restrictions and rough edges. It is work in progress with new releases coming out regularly.

If you have questions, please do not hesitate to contact us via our website http://www.xmlprint.eu or directly to print@uni-trier.de. Any and all feedback is welcome. Moreover, if you know some people you think could benefit from XML-Print, please feel free to spread the news amongst your peers.

Project homepage: http://www.xmlprint.eu
Source code: http://sourceforge.net/projects/xml-print/
Installers for Windows, Mac and Linux:
http://sourceforge.net/projects/xml-print/files/

Linked Data in Libraries, Archives, and Museums

Filed under: Archives,Library,Linked Data,Museums — Patrick Durusau @ 2:23 pm

Linked Data in Libraries, Archives, and Museums Information Standards Quarterly (ISQ) Spring/Summer 2012, Volume 24, no. 2/3 http://dx.doi.org/10.3789/isqv24n2-3.2012.

Interesting reading on linked data.

I have some comments on the “discovery” of the need to manage “diverse, heterogeneous metadata” but will save them for another post.

From the “flyer” that landed in my inbox:

The National Information Standards Organization (NISO) announces the publication of a special themed issue of the Information Standards Quarterly (ISQ) magazine on Linked Data for Libraries, Archives, and Museums. ISQ Guest Content Editor, Corey Harper, Metadata Services Librarian, New York University has pulled together a broad range of perspectives on what is happening today with linked data in cultural institutions. He states in his introductory letter, “As the Linked Data Web continues to expand, significant challenges remain around integrating such diverse data sources. As the variance of the data becomes increasingly clear, there is an emerging need for an infrastructure to manage the diverse vocabularies used throughout the Web-wide network of distributed metadata. Development and change in this area has been rapidly increasing; this is particularly exciting, as it gives a broad overview on the scope and breadth of developments happening in the world of Linked Open Data for Libraries, Archives, and Museums.”

The feature article by Gordon Dunsire, Corey Harper, Diane Hillmann, and Jon Phipps on Linked Data Vocabulary Management describes the shift in popular approaches to large-scale metadata management and interoperability to the increasing use of the Resource Description Framework to link bibliographic data into the larger web community. The authors also identify areas where best practices and standards are needed to ensure a common and effective linked data vocabulary infrastructure.

Four “in practice” articles illustrate the growth in the implementation of linked data in the cultural sector. Jane Stevenson in Linking Lives describes the work to enable structured and linked data from the Archives Hub in the UK. In Joining the Linked Data Cloud in a Cost-Effective Manner, Seth van Hooland, Ruben Verborgh, and Rik Van de Walle show how general purpose Interactive Data Transformation tools, such as Google Refine, can be used to efficiently perform the necessary task of data cleaning and reconciliation that precedes the opening up of linked data. Ted Fons, Jeff Penka, and Richard Wallis discuss OCLC’s Linked Data Initiative and the use of Schema.org in WorldCat to make library data relevant on the web. In Europeana: Moving to Linked Open Data , Antoine Isaac, Robina Clayphan, and Bernhard Haslhofer explain how the metadata for over 23 million objects are being converted to an RDF-based linked data model in the European Union’s flagship digital cultural heritage initiative.

Jon Voss provides a status on Linked Open Data for Libraries, Archives, and Museums (LODLAM) State of Affairs and the annual summit to advance this work. Thomas Elliott, Sebastian Heath, John Muccigrosso Report on the Linked Ancient World Data Institute, a workshop to further the availability of linked open data to create reusable digital resources with the classical studies disciplines.

Kevin Ford wraps up the contributed articles with a standard spotlight article on LC’s Bibliographic Framework Initiative and the Attractiveness of Linked Data. This Library of Congress-led community effort aims to transition from MARC 21 to a linked data model. “The move to a linked data model in libraries and other cultural institutions represents one of the most profound changes that our community is confronting,” stated Todd Carpenter, NISO Executive Director. “While it completely alters the way we have always described and cataloged bibliographic information, it offers tremendous opportunities for making this data accessible and usable in the larger, global web community. This special issue of ISQ demonstrates the great strides that libraries, archives, and museums have already made in this arena and illustrates the future world that awaits us.”

GraphConnect – Agenda – 2012 [San Francisco]

Filed under: Conferences,Graphs,Neo4j — Patrick Durusau @ 10:44 am

GraphConnect – Agenda – 2012


Monday, November 5 Full day Neo4j Tutorial and Community Meetup
Tuesday, November 6 Presentations

In case you have been wavering over registration and plane tickets:

Presentations range from campaign data, commerce, medical, drugs (legal drugs), to lessons for startups (people who want to be enterprises) or enterprises (people who want to be sovereign nations).

Do remember to vote absentee for the US Presidential election on November 6, 2012. That’s not an excuse for missing the conference!

Item based similarity with GraphChi

Filed under: GraphChi,GraphLab,Similarity — Patrick Durusau @ 10:15 am

Item based similarity with GraphChi by Danny Bickson.

From the post:

Item based collaborative filtering is one of the most popular collaborative filtering methods used in more than 70% of the companies I am talking to. Following my mega collaborator Justin Yan‘s advice, I have started to implement some item based similarity methods in GraphChi.

Item based methods compare all pairs of items together, for computing similarity metric between each pair of items. This task becomes very quickly computation intensive. For example, Netflix data has around 17K movies. If we want to compute all pairs of movies to find the most similar ones, we need to compare around 290M pairs!

If we use a symmetric similarity measure, the distance between movie A and B, is similar to the distance between movie B and A. Thus for the Netflix example we have around 145M pairs to compute. To reduce the work furthermore, we only compare movies which where watched together by at least X users, for example X=5. Otherwise, those movies are not considered similar.

When the dataset is big, it is not possible to load it fully into memory at a single machine. That is where GraphChi comes in. My preliminary implementation of the item similarity computes similarity between all pairs without fully reading the dataset into memory. The idea is to load a chunk of the items (called pivots) into memory, and then stream over the rest of the items by comparing the pivots to the rest of the items.

I need to check on Danny’s blog and the GraphChi/GraphLab webpages more often!

GraphChi parsers toolkit

Filed under: GraphChi,GraphLab,Graphs,Latent Dirichlet Allocation (LDA),Parsers,Tweets — Patrick Durusau @ 9:53 am

GraphChi parsers toolkit by Danny Bickson.

From the post:

To the request of Baoqiang Cao I have started a parsers toolkits in GraphChi to be used for preparing data to be used in GraphLab/ Graphchi. The parsers should be used as template which can be easy customized to user specific needs.

Danny starts us off with an LDA parser (with worked example of its use) and then adds a Twitter parser that creates a graph of retweets.

Enjoy!

Energy Models for Graph Clustering

Filed under: Clustering,Graphs — Patrick Durusau @ 9:45 am

Interesting paper: energy models for graph clustering

Danny Bickson writes:

I got today a question by Timmy Wilson, our man in Ohio, about the paper: Energy Models for Graph Clustering by Andreas Noack. This paper has a nice treatment for power law graph visualization (like social networks). In traditional layouts, the popular nodes which have a lot of links have larger importance and thus are visualized in the center, so we get a messy layout:

The abstract from the paper:

The cluster structure of many real-world graphs is of great interest, as the clusters may correspond e.g. to communities in social networks or to cohesive modules in software systems. Layouts can naturally represent the cluster structure of graphs by grouping densely connected nodes and separating sparsely connected nodes. This article introduces two energy models whose minimum energy layouts represent the cluster structure, one based on repulsion between nodes (like most existing energy models) and one based on repulsion between edges. The latter model is not biased towards grouping nodes with high degrees, and is thus more appropriate for the many real-world graphs with right-skewed degree distributions. The two energy models are shown to be closely related to widely used quality criteria for graph clusterings – namely the density of the cut, Shi and Malik’s normalized cut, and Newman’s modularity – and to objective functions optimized by eigenvector-based graph drawing methods.

The illustrations make a compelling case for edge-based repulsion versus node-based repulsion.

Take note of Danny’s comments on problems with this approach and GraphLab.

Web Data Extraction, Applications and Techniques: A Survey

Filed under: Data Mining,Extraction,Machine Learning,Text Extraction,Text Mining — Patrick Durusau @ 5:05 am

Web Data Extraction, Applications and Techniques: A Survey by Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, Robert Baumgartner.

Abstract:

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of application domains. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc application domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction.

This survey aims at providing a structured and comprehensive overview of the research efforts made in the field of Web Data Extraction. The fil rouge of our work is to provide a classification of existing approaches in terms of the applications for which they have been employed. This differentiates our work from other surveys devoted to classify existing approaches on the basis of the algorithms, techniques and tools they use.

We classified Web Data Extraction approaches into categories and, for each category, we illustrated the basic techniques along with their main variants.

We grouped existing applications in two main areas: applications at the Enterprise level and at the Social Web level. Such a classification relies on a twofold reason: on one hand, Web Data Extraction techniques emerged as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. On the other hand, Web Data Extraction techniques allow for gathering a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities of analyzing human behaviors on a large scale.

We discussed also about the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.

Comprehensive (> 50 pages) survey of web data extraction. Supplements and updates existing work by its focus on classifying by field of use, web data extraction.

Very likely to lead to adaptation of techniques from one field to another.

Measuring Similarity in Large-scale Folksonomies [Users vs. Authorities]

Measuring Similarity in Large-scale Folksonomies by Giovanni Quattrone, Emilio Ferrara, Pasquale De Meo, and Licia Capra.

Abstract:

Social (or folksonomic) tagging has become a very popular way to describe content within Web 2.0 websites. Unlike taxonomies, which overimpose a hierarchical categorisation of content, folksonomies enable end-users to freely create and choose the categories (in this case, tags) that best describe some content. However, as tags are informally defined, continually changing, and ungoverned, social tagging has often been criticised for lowering, rather than increasing, the efficiency of searching, due to the number of synonyms, homonyms, polysemy, as well as the heterogeneity of users and the noise they introduce. To address this issue, a variety of approaches have been proposed that recommend users what tags to use, both when labelling and when looking for resources.

As we illustrate in this paper, real world folksonomies are characterized by power law distributions of tags, over which commonly used similarity metrics, including the Jaccard coefficient and the cosine similarity, fail to compute. We thus propose a novel metric, specifically developed to capture similarity in large-scale folksonomies, that is based on a mutual reinforcement principle: that is, two tags are deemed similar if they have been associated to similar resources, and vice-versa two resources are deemed similar if they have been labelled by similar tags. We offer an efficient realisation of this similarity metric, and assess its quality experimentally, by comparing it against cosine similarity, on three large-scale datasets, namely Bibsonomy, MovieLens and CiteULike.

Studying language (tags) as used tells you about users.

Studying language as proscribed by an authority, tells you about that authority.

Which one is of interest to you?

Context-Aware Recommender Systems 2012 [Identity and Context?]

Filed under: Context,Context-aware,Identity,Recommendation — Patrick Durusau @ 4:33 am

Context-Aware Recommender Systems 2012 (In conjunction with the 6th ACM Conference on Recommender Systems (RecSys 2012))

I usually think of recommender systems as attempts to deliver content based on clues about my interests or context. If I dial 911, the location of the nearest pizza vendor probably isn’t high on my lists of interests, etc.

As I looked over these proceedings, it occurred to me that subject identity, for merging purposes, isn’t limited to the context of the subject in question.

That is some merging tests could depend upon my context as a user.

Take my 911 call for instance. For many purposes, a police substation, fire station, 24 hour medical clinic and a hospital are different subjects.

In a medical emergency situation, for which a 911 call might be a clue, all of those could be treated as a single subject – places for immediate medical attention.

What other subjects do you think might merge (or not) depending upon your context?

Table of Contents

  1. Optimal Feature Selection for Context-Aware Recommendation Using Differential Relaxation
    Yong Zheng, Robin Burke, Bamshad Mobasher.
  2. Relevant Context in a Movie Recommender System: Users’ Opinion vs. Statistical Detection
    Ante Odic, Marko Tkalcic, Jurij Franc Tasic, Andrej Kosir.
  3. Improving Novelty in Streaming Recommendation Using a Context Model
    Doina Alexandra Dumitrescu, Simone Santini.
  4. Towards a Context-Aware Photo Recommender System
    Fabricio Lemos, Rafael Carmo, Windson Viana, Rossana Andrade.
  5. Context and Intention-Awareness in POIs Recommender Systems
    Hernani Costa, Barbara Furtado, Durval Pires, Luis Macedo, F. Amilcar Cardoso.
  6. Evaluation and User Acceptance Issues of a Bayesian-Classifier-Based TV Recommendation System
    Benedikt Engelbert, Karsten Morisse, Kai-Christoph Hamborg.
  7. From Online Browsing to Offline Purchases: Analyzing Contextual Information in the Retail Business
    Simon Chan, Licia Capra.

Analyzing Big Data with Twitter

Filed under: BigData,CS Lectures,MapReduce,Pig — Patrick Durusau @ 3:39 am

Analyzing Big Data with Twitter

Not really with Twitter but with tools sponsored/developed/used by Twitter. Lecture series at the UC Berkeley School of Information.

Videos of lectures are posted online.

Check out the syllabus for assignments and current content.

Four (4) lectures so far!

  • Big Data Analytics with Twitter – Marti Hearst & Gilad Mishne. Introduction to Twitter in general.
  • Twitter Philosophy and Software Architecture – Othman Laraki & Raffi Krikorian.
  • Introduction to Hadoop – Bill Graham.
  • Apache Pig – Jon Coveney
  • … more to follow.

…1 Million TPS on $5K Hardware

Filed under: Alchemy Database,Systems Research — Patrick Durusau @ 2:47 am

Russ’ 10 Ingredient Recipe for Making 1 Million TPS on $5K Hardware

Got your attention? Good. Read on:

My name is Russell Sullivan, I am the author of AlchemyDB: a highly flexible NoSQL/SQL/DocumentStore/GraphDB-datastore built on top of redis. I have spent the last several years trying to find a way to sanely house multiple datastore-genres under one roof while (almost paradoxically) pushing performance to its limits.

I recently joined the NoSQL company Aerospike (formerly Citrusleaf) with the goal of incrementally grafting AlchemyDB’s flexible data-modeling capabilities onto Aerospike’s high-velocity horizontally-scalable key-value data-fabric. We recently completed a peak-performance TPS optimization project: starting at 200K TPS, pushing to the recent community edition launch at 500K TPS, and finally arriving at our 2012 goal: 1M TPS on $5K hardware.

Getting to one million over-the-wire client-server database-requests per-second on a single machine costing $5K is a balance between trimming overhead on many axes and using a shared nothing architecture to isolate the paths taken by unique requests.

Even if you aren’t building a database server the techniques described in this post might be interesting as they are not database server specific. They could be applied to a ftp server, a static web server, and even to a dynamic web server.

My blog falls short of needing that level of TPS per second but your experience may be different. 😉

It is a good read in any case.

September 10, 2012

Sunlight Academy (Finding US Government Data)

Filed under: Government,Government Data,Law,Law - Sources — Patrick Durusau @ 4:05 pm

Sunlight Academy

From the website:

Welcome to Sunlight Academy, a collection of interactive tutorials for journalists, activists, researchers and students to learn about tools by the Sunlight Foundation and others to unlock government data.

Be sure to create a profile to access our curriculum, track your progress, watch videos, complete training activities and get updates on new tutorials and tools.

Whether you are an investigative journalist trying to get insight on a complex data set, an activist uncovering the hidden influence behind your issue, or a congressional staffer in need of mastering legislative data, Sunlight Academy guides you through how to make our tools work for you. Let’s get started!

The Sunlight Foundation has created tools to make government data more accessible.

Unlike some governments and software projects, the Sunlight Foundation business model isn’t based on poor or non-existent documentation.

Modules (as of 2012 September 10):

  • Tracking Government
    • Scout Scout is a legislative and governmental tracking tool from the Sunlight Foundation that alerts you when Congress or your state capitol talks about or takes action on issues you care about. Learn how to search and create alerts on federal and state legislation, regulations and the Congressional Record.
    • Scout (Webinar) Recorded webinar and demo of Scout from July 26, 2012. The session covered basic skills such as search terms and bill queries, as well as advanced functions such as tagging, merging outside RSS feeds and creating curated search collections.
  • Unlocking Data
    • Political Ad Sleuth Frustrated by political ads inundating your TV? Learn how you can discover who is funding these ads from the public files at your local television station through this tutorial.
    • Unlocking APIs What are APIs and how do they deliver government data? This tutorial provides an introduction to using APIs and highlights what Sunlight’s APIs have to offer on legislative and congressional data.
  • Lobbying
    • Lobbying Contribution Reports These reports highlight the millions of dollars that lobbying entities spend every year giving to charities in honor of lawmakers and executive branch officials, technically referred to as “honorary fees.” Find out how to investigate lobbying contribution reports, understand the rules behind them and see what you can do with the findings.
    • Lobbying Registration Tracker Learn about the Lobbying Registration Tracker, a Sunlight Foundation tool that allows you to track new registrations for federal lobbyists and lobbying firms. This database allows users to view registrations as they’re submitted, browse by issue, registrant or client, and see the trends in issues and registrations over the last 12 months.
    • Lobbying Report Form Four times a year, groups that lobby Congress and the federal government file reports on their activities. Unlock the important information contained in the quarterly lobbying reports to keep track of who’s influencing whom in Washington. Learn tips on how to read the reports and how they can inform your reporting.
  • Data Analysis
    • Data Visualizations in Google Docs While Google is often used for internet searches and maps, it can also help with data visualizations via Google Charts. Learn how to use Google Docs to generate interactive charts in this training.
    • Mapping Campaign Finance Data Campaign finance data can be complex and confusing — for reporters and for readers. But it doesn’t have to be. One way to make sense of it all is through mapping. Learn how to turn campaign finance information into beautiful maps, all through free tools.
    • Pivot Tables Pivot tables are powerful tools, but it’s not always obvious how to use them. Learn how to create and use pivot tables in Excel to aggregate and summarize data that otherwise would require a database.
  • Research Tools
    • Advanced Google Searches Google has made search easy and effective, but that doesn’t mean it can’t be better. Learn how to effectively use Google’s Advanced Search operators so you can get what you’re looking for without wasting time on irrelevant results.
    • Follow the Unlimited Money (webinar) Recorded webinar from August 8, 2012. This webinar covered tools to follow the millions of dollars being spent this election year by super PACs and other outside groups.
    • Learning about Data.gov Data.gov seeks to organize all of the U.S. government’s data, a daunting and unfinished task. In this module, learn about the powers and limitations of Data.gov, and what other resources to use to fill in Data.gov’s gaps.

Researching Current Federal Legislation and Regulations:…

Filed under: Government,Government Data,Law,Law - Sources,Legal Informatics — Patrick Durusau @ 3:30 pm

Researching Current Federal Legislation and Regulations: A Guide to Resources for Congressional Staff

Description quoted at Full Text Reports:

This report is designed to introduce congressional staff to selected governmental and nongovernmental sources that are useful in tracking and obtaining information federal legislation and regulations. It includes governmental sources such as the Legislative Information System (LIS), THOMAS, the Government Printing Office’s Federal Digital System (FDsys), and U.S. Senate and House websites. Nongovernmental or commercial sources include resources such as HeinOnline and the Congressional Quarterly (CQ) websites. It also highlights classes offered by the Congressional Research Service (CRS) and the Library of Congress Law Library.

This report will be updated as new information is available.

Direct link to PDF: Researching Current Federal Legislation and Regulations: A Guide to Resources for Congressional Staff

A very useful starting point for research on U.S. federal legislation and regulations, but only a starting point.

Each listed resource merits a user’s guide. And no two of them are exactly the same.

Suggestions for research/topic map exercises based on this listing of resources?

Automating Your Cluster with Cloudera Manager API

Filed under: Clustering (servers),HDFS,MapReduce — Patrick Durusau @ 3:01 pm

Automating Your Cluster with Cloudera Manager API

From the post:

API access was a new feature introduced in Cloudera Manager 4.0 (download free edition here.). Although not visible in the UI, this feature is very powerful, providing programmatic access to cluster operations (such as configuration and restart) and monitoring information (such as health and metrics). This article walks through an example of setting up a 4-node HDFS and MapReduce cluster via the Cloudera Manager (CM) API.

Cloudera Manager API Basics

The CM API is an HTTP REST API, using JSON serialization. The API is served on the same host and port as the CM web UI, and does not require an extra process or extra configuration. The API supports HTTP Basic Authentication, accepting the same users and credentials as the Web UI. API users have the same privileges as they do in the web UI world.

You can read the full API documentation here.

We are nearing mid-September so the holiday season will be here before long. It isn’t too early to start planning on price/hardware break points.

This will help configure a HDFS and MapReduce cluster on your holiday hardware.

Mapping solution to heterogeneous data sources

Filed under: Bioinformatics,Biomedical,Genome,Heterogeneous Data,Mapping — Patrick Durusau @ 2:21 pm

dbSNO: a database of cysteine S-nitrosylation by Tzong-Yi Lee, Yi-Ju Chen, Cheng-Tsung Lu, Wei-Chieh Ching, Yu-Chuan Teng, Hsien-Da Huang and Yu-Ju Chen. (Bioinformatics (2012) 28 (17): 2293-2295. doi: 10.1093/bioinformatics/bts436)

OK, the title doesn’t jump out and say “mapping solution here!” 😉

Reading a bit further, you discover that text mining is used to locate sequences and that data is then mapped to “UniProtKB protein entries.”

The data set provides access to:

  • UniProt ID
  • Organism
  • Position
  • PubMed Id
  • Sequence

My concern is what happens when X is mapped to a UniProtKB protein entry to:

  • The prior identifier for X (in the article or source), and
  • The mapping from X to the UniProtKB protein entry?

If both of those are captured, then prior literature can be annotated upon rendering to point to later aggregation of information on a subject.

If the prior identifier, place of usage, the mapping, etc., are not captured, then prior literature, when we encounter it, remains frozen in time.

Mapping solutions work, but repay the effort several times over if the prior identifier and its mapping to the “new” identifier are captured as part of the process.

Abstract

Summary: S-nitrosylation (SNO), a selective and reversible protein post-translational modification that involves the covalent attachment of nitric oxide (NO) to the sulfur atom of cysteine, critically regulates protein activity, localization and stability. Due to its importance in regulating protein functions and cell signaling, a mass spectrometry-based proteomics method rapidly evolved to increase the dataset of experimentally determined SNO sites. However, there is currently no database dedicated to the integration of all experimentally verified S-nitrosylation sites with their structural or functional information. Thus, the dbSNO database is created to integrate all available datasets and to provide their structural analysis. Up to April 15, 2012, the dbSNO has manually accumulated >3000 experimentally verified S-nitrosylated peptides from 219 research articles using a text mining approach. To solve the heterogeneity among the data collected from different sources, the sequence identity of these reported S-nitrosylated peptides are mapped to the UniProtKB protein entries. To delineate the structural correlation and consensus motif of these SNO sites, the dbSNO database also provides structural and functional analyses, including the motifs of substrate sites, solvent accessibility, protein secondary and tertiary structures, protein domains and gene ontology.

Availability: The dbSNO is now freely accessible via http://dbSNO.mbc.nctu.edu.tw. The database content is regularly updated upon collecting new data obtained from continuously surveying research articles.

Contacts: francis@saturn.yu.edu.tw or yujuchen@gate.sinica.edu.tw.

« Newer PostsOlder Posts »

Powered by WordPress