Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

June 7, 2012

Data Prospecting

Filed under: Contest,Data Analysis — Patrick Durusau @ 2:17 pm

Derrick Harris writes in: Kaggle is now crowdsourcing big data creativity about a new product from Kaggle, Kaggle Prospect:

The Kaggle Prospect homepage says:

Kaggle Prospect is an open data exploration and problem identification platform that lets organizations with large datasets solicit proposals from the best minds in our 40,000 strong community of predictive modeling and machine learning experts. The experts will peer-review each others ideas’ and we’ll present you with the short list of what problems your data could answer.

If you are sitting on a gold mine of data, but aren’t sure where to start digging, Kaggle Prospect is the place to start.

Kaggle Prospect has a great deal of promise. Assuming enough users can pry data out of data silos for submission. 😉

If you are not familiar with Kaggle contests, see: Kaggle.

PS: I like the Kaggle headline:

We’re making data science a sport.™

May 8, 2012

Reading Other People’s Mail For Fun and Profit

Filed under: Analytics,Data Analysis,Intelligence — Patrick Durusau @ 6:16 pm

Bob Gourley writes much better content than he does titles: Osama Bin Laden Letters Analyzed: A rapid assessment using Recorded Future’s temporal analytic technologies and intelligence analysis tools. (Sorry Bob.)

Bob writes:

The Analysis Intelligence site provides open source analysis and information on a variety of topics based on the the temporal analytic technology and intelligence analysis tools of Recorded Future. Shortly after the release of 175 pages of documents from the Combatting Terrorism Center (CTC) a very interesting assessment was posted on the site. This assessment sheds light on the nature of these documents and also highlights some of the important context that the powerful capabilities of Recorded Future can provide.

The analysis by Recorded Future is succinct and well done so I cite most of it below. I’ll conclude with some of my own thoughts as an experienced intelligence professional and technologist on some of the “So What” of this assessment.

If you are interested in analytics, particularly visual analytics, you will really appreciate this piece.

Recorded Future has a post on the US Presidential Election. Just to be on the safe side, I would “fuzz” the data when it got close to the election. 😉

Intent vs. Inference

Filed under: Data,Data Analysis,Inference,Intent — Patrick Durusau @ 3:03 pm

Intent vs. Inference by David Loshin.

David writes:

I think that the biggest issue with integrating external data into the organization (especially for business intelligence purposes) is related to the question of data repurposing. It is one thing to consider data sharing for cross-organization business processes (such as brokering transactions between two different trading partners) because those data exchanges are governed by well-defined standards. It is another when your organization is tapping into a data stream created for one purpose to use the data for another purpose, because there are no negotiated standards.

In the best of cases, you are working with some published metadata. In my previous post I referred to the public data at www.data.gov, and those data sets are sometimes accompanied by their data layouts or metadata. In the worst case, you are integrating a data stream with no provided metadata. In both cases, you, as the data consumer, must make some subjective judgments about how that data can be used.

A caution about “intent” or as I knew it, the intentional fallacy in literary criticism. It is popular in some legal circles in the United States as well.

One problem is that there is no common basis for determining authorial intent.

Another problem is that “intent” is often used to privilege one view over others as representing the “intent” of the author. The “original” view is beyond questioning or criticism because it is the “intent” of the original author.

It should come as no surprise that for law (Scalia and the constitution) and the Bible (you pick’em), “original intent” means agrees with the speaker.

It isn’t entirely clear where David is going with this thread but I would simply drop the question of intent and ask two questions:

  1. What is the purpose of this data?
  2. Is the data suited to that purpose?

Where #1 may include what inferences we want to make, etc.

Cuts to the chase as it were.

May 4, 2012

Bridging the Data Science Gap (DataKind)

Filed under: Data,Data Analysis,Data Science,Data Without Borders,DataKind — Patrick Durusau @ 3:43 pm

Bridging the Data Science Gap

From the post:

Data Without Borders connects data scientists with social organizations to maximize their impact.

Data scientists want to contribute to the public good. Social organizations often boast large caches of data but neither the resources nor the skills to glean insights from them. In the worst case scenario, the information becomes data exhaust, lost to neglect, lack of space, or outdated formats. Jake Porway, Data Without Borders [DataKind] founder and The New York Times data scientist, explored how to bridge this gap during the second Big Data for the Public Good seminar, hosted by Code for America and sponsored by Greenplum, a division of EMC.

Code for America founder Jennifer Pahlka opened the seminar with an appeal to the data practitioners in the room to volunteer for social organizations and civic coding projects. She pointed to hackathons such the ones organized during the nationwide event Code Across America as being examples of the emergence of a new kind of “third place”, referencing sociologist Ray Oldenburg’s theory that the health of a civic society depends upon shared public spaces that are neither home nor work. Hackathons, civic action networks like the recently announced Code for America Brigade, and social organizations are all tangible third spaces where data scientists can connect with community while contributing to the public good.

These principles are core to the Data Without Borders [DataKind] mission. “Anytime there’s a process, there’s data,” Porway emphasized to the audience. Yet much of what is generated is lost, particularly in the third world, where a great amount of information goes unrecorded. In some cases, the social organizations that often operate on shoestring budgets may not even appreciate the value of what they’re losing. Meanwhile, many data scientists working in the private sector want to contribute their skills for the social good in their off-time. “On the one hand, we have a group of people who are really good at looking at data, really good at analyzing things, but don’t have a lot of social outputs for it,” Porway said. “On the other hand, we have social organizations that are surrounded by data and are trying to do really good things for the world but don’t have anybody to look at it.”

The surplus of free work to be done is endless but thought you might find this interesting.

Data Without Borders – name change -> DataKind, Facebook page, @datakind on Twitter.

Good opportunity to show off your topic mappings skills!

April 28, 2012

Workflow for statistical data analysis

Filed under: Data Analysis,R,Statistics — Patrick Durusau @ 6:06 pm

Workflow for statistical data analysis by Christophe Lalanne.

A short summary of Oliver Kirchkamp’s Workflow of statistical data analysis, which takes the reader from data to paper.

Christophe says a more detailed review is likely to follow but at eighty-six (86) pages, you could read it yourself and make detailed comments as well.

April 24, 2012

Data Virtualization

Filed under: BigData,Data,Data Analysis,Data Virtualization — Patrick Durusau @ 7:17 pm

David Loshin has a series of excellent posts on data virtualization:

Fundamental Challenges in Data Reusability and Repurposing (Part 1 of 3)

Simplistic Approaches to Data Federation Solve (Only) Part of the Puzzle – We Need Data Virtualization (Part 2 of 3)

Key Characteristics of a Data Virtualization Solution (Part 3 of 3)

In part 3, David concludes:

In other words, to truly provision high quality and consistent data with minimized latency from a heterogeneous set of sources, a data virtualization framework must provide at least these capabilities:

  • Access methods for a broad set of data sources, both persistent and streaming
  • Early involvement of the business user to create virtual views without help from IT
  • Software caching to enable rapid access in real time
  • Consistent views into the underlying sources
  • Query optimizations to retain high performance
  • Visibility into the enterprise metadata and data architectures
  • Views into shared reference data
  • Accessibility of shared business rules associated with data quality
  • Integrated data profiling for data validation
  • Integrated application of advanced data transformation rules that ensure consistency and accuracy

What differentiates a comprehensive data virtualization framework from simplistic layering of access and caching services via data federation is that the comprehensive data virtualization solution goes beyond just data federation. It is not only about heterogeneity and latency, but must incorporate the methodologies that are standardized within the business processes to ensure semantic consistency for the business. If you truly want to exploit the data virtualization layer for performance and quality, you need to have aspects of the meaning and differentiation between use of the data engineered directly into the implementation. And most importantly, also make sure the business user signs-off on the data that is being virtualized for consumption. (emphasis added)

David makes explicit a number of issues, such as integration architectures needing to peer into enterprise metadata and data structures, making it plain that not only data, but the ways we contain/store data has semantics as well.

I would add: Consistency and accuracy should be checked on a regular basis with specified parameters for acceptable correctness.

The heterogeneous data sources that David speaks of are ever changing, both in form and semantics. If you need proof of that, consider the history of ETL at your company. If either form or semantics were stable, that would be a once or twice in a career event. I think we all know that is not the case.

Topic maps can disclose the data and rules for the virtualization decisions that David enumerates. Which has the potential to make those decisions themselves auditable and reusable.

Reuse being an advantage in a constantly changing and heterogeneous semantic environment. Semantics seen once, are very likely to be seen again. (Patterns anyone?)

April 19, 2012

Knoema Launches the World’s First Knowledge Platform Leveraging Data

Filed under: Data,Data Analysis,Data as Service (DaaS),Data Mining,Knoema,Statistics — Patrick Durusau @ 7:13 pm

Knoema Launches the World’s First Knowledge Platform Leveraging Data

From the post:

DEMO Spring 2012 conference — Today at DEMO Spring 2012, Knoema launched publicly the world’s first knowledge platform that leverages data and offers tools to its users to harness the knowledge hidden within the data. Search and exploration of public data, its visualization and analysis have never been easier. With more than 500 datasets on various topics, gallery of interactive, ready to use dashboards and its user friendly analysis and visualization tools, Knoema does for data what YouTube did to videos.

Millions of users interested in data, like analysts, students, researchers and journalists, struggle to satisfy their data needs. At the same time there are many organizations, companies and government agencies around the world collecting and publishing data on various topics. But still getting access to relevant data for analysis or research can take hours with final outcomes in many formats and standards that can take even longer to get it to a shape where it can be used. This is one of the issues that the search engines like Google or Bing face even after indexing the entire Internet due to the nature of statistical data and diversity and complexity of sources.

One-stop shop for data. Knoema, with its state of the art search engine, makes it a matter of minutes if not seconds to find statistical data on almost any topic in easy to ingest formats. Knoema’s search instantly provides highly relevant results with chart previews and actual numbers. Search results can be further explored with Dataset Browser tool. In Dataset Browser tool, users can get full access to the entire public data collection, explore it, visualize data on tables/charts and download it as Excel/CSV files.

Numbers made easier to understand and use. Knoema enables end-to-end experience for data users, allowing creation of highly visual, interactive dashboards with a combination of text, tables, charts and maps. Dashboards built by users can be shared to other people or on social media, exported to Excel or PowerPoint and embedded to blogs or any other web site. All public dashboards made by users are available in dashboard gallery on home page. People can collaborate on data related issues participating in discussions, exchanging data and content.

Excellent!!!

When “other” data becomes available, users will want to integrate it with their data.

But “other” data will have different or incompatible semantics.

So much for attempts to wrestle semantics to the ground (W3C) or build semantic prisons (unnamed vendors).

What semantics are useful to you today? (patrick@durusau.net)

April 15, 2012

“Verdict First, Then The Trial”

Filed under: Data Analysis,Exploratory Data Analysis — Patrick Durusau @ 7:13 pm

No, not the Trevon Martin case but rather the lack of “exploratory data analysis” in business environments.

From Business Intelligence Ain’t Over Until Exploratory Data Analysis Sings, where Wayne Kernochan reviews the rise of statistical analysis in businesses and then says:

And yet there is a glaring gap in this picture – or at least a gap that should be glaring. This gap might be summed up as Alice in Wonderland’s “verdict first, then the trial.” Both the business and the researcher start with their own narrow picture of what the customer or research subject should look like, and the analytics and statistics that accompany such hypotheses are designed to narrow in on a solution rather than expand due to unexpected data. Thus, the business/researcher is likely to miss key customer insights, psychological and otherwise.

Pile on top of this the “not invented here” syndrome characteristic of most enterprises, and the “confirmation bias” that recent research has shown to be prevalent among individuals and organizations, and you have a real analytical problem on your hands. (emphasis added)

I don’t know if I would call it “a real analytical problem” so much as I would call it “business as usual.”

There may be a real coming shortage of people who can turn the crank to make the usual analysis come out the other end.

Can you imagine the shortage of people who possess the analytical skills and initiative to do more than the usual analysis?

The ability to recognize when two or more departments have different vocabularies for the same things is one indicator of possible analytical talent.

What are some others? (Thinking you can also use these to find topic map authors for your business/organization.)

April 10, 2012

Metablogging MADlib

Filed under: Data Analysis,SQL — Patrick Durusau @ 6:44 pm

Metablogging MADlib

Joseph M. Hellerstein writes:

When the folks at ACM SIGMOD asked me to be a guest blogger this month, I figured I should highlight the most community-facing work I’m involved with. So I wrote up a discussion of MADlib, and that the fact that this open-source in-database analytics library is now open to community contributions. (A bunch of us recently wrote a paper on the design and use of MADlib, which made my writing job a bit easier.) I’m optimistic about MADlib closing a gap between algorithm researchers and working data scientists, using familiar SQL as a vector for adoption on both fronts.

I kicked off MADlib as a part-time consulting project for Greenplum during my sabbatical in 2010-2011. As I built out the first two methods (FM and CountMin sketches) and an installer, Greenplum started assembling a team of their own engineers and data scientists to overlap with and eventually replace me when I returned to campus. They also developed a roadmap of additional methods that their customers wanted in the field. Eighteen months later, Greenplum now contributes the bulk of the labor, management and expertise for the project, and has built bridges to leading academics as well.

Like they said at Woodstock, “if you don’t think SQL is all that weird….” you might want to stop by the MADlib project. (I will have to go listen to the soundtrack. That may not be an exact quote.)

This is an important project for database analytics in an SQL context.

March 1, 2012

Target, Pregnancy and Predictive Analytics (parts 1 and 2)

Filed under: Data Analysis,Machine Learning,Predictive Analytics — Patrick Durusau @ 9:02 pm

Dean Abbott wrote a pair of posts on a New York Times article about Target predicting if customers are pregnant.

Target, Pregnancy and Predictive Analytics (part 1)

Target, Pregnancy and Predictive Analytics (part 2)

Read both I truly liked his conclusion that models give us the patterns in data but it is up to us to “recognize” the patterns as significant.

BTW, I do wonder what the different is between the New York Times snooping for secrets to sell newspapers versus Target to sell products? If you know, please give a shout!

February 10, 2012

Wolfram Alpha Pro democratizes data analysis:…

Filed under: Data Analysis — Patrick Durusau @ 4:12 pm

Wolfram Alpha Pro democratizes data analysis: an in-depth look at the $4.99 a month service by Dieter Bohn.

From the post:

On Wednesday, February 8th, Wolfram Alpha will be adding a new, “Pro” option to its already existing services. Priced at a very reasonable $4.99 a month ($2.99 for students), the new services includes the ability to use images, files, and even your own data as inputs instead of simple text entry. The “reports” that Wolfram Alpha kicks out as a result of these (or any) query are also beefed up for Pro users, some will actually become interactive charts and all of them can be more easily exported in a variety of formats. We sat down with Stephen Wolfram himself to get a tour of the new features and to discuss what they mean for his goal of “making the world’s knowledge computable.”

Computers have certainly played a leading role in the hard sciences over the last seventy or so years but I remain sceptical about their role in the difficult sciences. It is true that computers can assist in quickly locating all the uses of a particular string in Greek, Hebrew or Ugaritic. But determining the semantics of such a string requires more than the ability to count quickly.

Still, Wolfram created a significant tool for mathematical research (Mathematica) so his work on broader areas of human knowledge merits a close look.

February 1, 2012

GraphInsight

Filed under: Data Analysis,Data Structures,Graphs,Visualization — Patrick Durusau @ 4:38 pm

GraphInsight

From the webpage:

Interative graph exploration

GraphInsight is a visualization software that lets you explore graph data through high quality interactive representations.

(video omitted)

Data exploration and knowledge extraction from graphs is of great interest nowadays: Knowledge is disseminated in social networks, and services are powered by cloud computing platforms. Data miners deal with graphs every day.

Humans are extremely good in identifying patterns and outliers. We believe that interacting visually with your data can give you a better intuition, and higher confidence on what you are looking for.

The video is just a little over one (1) minute long and is worth seeing.

Won’t tell you how to best display your data but does illustrate some of the capabilities of the software.

There are a number of graph rendering packages already but interactive ones are less common.

Now if we can have interactive graph software that hides/displays the graph underlying a text with all of the sub-graphs related to its content. So that it starts to mimic regular reading practice that goes off on tangents and finds support for ideas in unlikely spaces, that would be something really different.

January 30, 2012

Google Analytics Tutorial: 8 Valuable Tips To Hustle With Data!

Filed under: Dashboard,Data Analysis,Google Analytics,Marketing — Patrick Durusau @ 8:00 pm

Google Analytics Tutorial: 8 Valuable Tips To Hustle With Data! by Avinash Kaushik.

This is simply awesome! For several reasons.

I started to say because it’s an excellent guide to Google Analytics!

I started to say because it has so many useful outlinks to other resources and software.

And all that is very true, but not my “take away” from the post.

My “take away” from the post is that to succeed, “Analysis Ninjas” need to delivery useful results to users.

That means both information they are interested in seeing and delivered in a way that works for them.

The corollary is that data of no interest to users or delivered in ways users can’t understand or easily use, are losing strategies.

That means you don’t create web interfaces that mimic interfaces that failed for applications.

That means given the choice of doing a demo with Sumerian (something I would like to see) or something with the interest level of American Idol, you choose the American Idol type project.

Avinash has outlined some of the tools for data analysis. What you make of them is limited only by your imagination.

January 27, 2012

Analytics with MongoDB (commercial opportunity here)

Filed under: Analytics,Data,Data Analysis,MongoDB — Patrick Durusau @ 4:35 pm

Analytics with MongoDB

Interesting enough slide deck on analytics with MongoDB.

Relies on custom programming and then closes with this punchline (along with others, slide #41):

  • If you’re a business analyst you have a problem
    • better be BFF with some engineer 🙂

I remember when word processing required a lot of “dot” commands and editing markup languages with little or no editor support. Twenty years (has it been that long?) later and business analysts are doing word processing, markup and damned near print shop presentation without working close to the metal.

Can anyone name any products that have made large sums of money making it possible for business analysts and others to perform those tasks?

If so, ask yourself if you would like to have a piece of the action that frees business analysts from script kiddie engineers?

Even if a general application is out of reach at present, imagine writing access routines for common public data sites.

Create a market for the means to import and access particular data sets.

January 5, 2012

Baltimore gun offenders and where academics don’t live

Filed under: Data Analysis,Geographic Data,Statistics — Patrick Durusau @ 4:06 pm

Baltimore gun offenders and where academics don’t live

An interesting plotting of the residential addresses (not crime locations) of gun offenders. You need to see the post to observe how stark the “island” of academics appears on the map.

Illustration of non-causation, unless you want to contend that the presence of academics in a neighborhood drives out gun offenders. Which would argue in favor of more employment and wider residential patterns for academics. I would favor that but suspect that is personal bias.

A cross between this map and a map of gun offenses would be a good guide for housing prospects in Baltimore.

What other data would be useful for such a map? Education, libraries, fire protection, other crime rates…. Easy enough since there are geographic boundaries as the binding points but “summing up” information as you zoom out might be interesting.

That is say crime statistics are on a police district basis and as you zoom out, you want information from multiple districts merged and resorted. Or you have overlapping districts for water, electricity, police, fire, etc. Having a geographic grid becomes your starting place but only a starting place.

January 4, 2012

Algorithm estimates who’s in control

Filed under: Data Analysis,Discourse,Linguistics,Social Graphs,Social Networks — Patrick Durusau @ 10:43 am

Algorithm estimates who’s in control

John Kleinberg, whose work influenced Google’s PageRank, is working on ranking something else. Kelinberg et al. developed an algorithm that ranks people, based on how they speak to each other.

This on the heels of the Big Brother’s Name is… has to have you wondering if you even want Internet access at all. 😉

Just imagine, power (who has, who doesn’t) analysis of email discussion lists, wiki edits, email archives, transcripts.

This has the potential (along with other clever analysis) to identify and populate topic maps with some very interesting subjects.

I first saw this at FlowingData

January 3, 2012

Mining Massive Data Sets – Update

Filed under: BigData,Data Analysis,Data Mining,Dataset — Patrick Durusau @ 5:03 pm

Mining Massive Data Sets by Anand Rajaraman and Jeff Ullman.

Update of Mining of Massive Datasets – eBook.

The hard copy has been published by Cambridge Press.

The electronic version remains available for download. (Hint, suggest all of us who can should buy a hard copy to encourage this sort of publisher behavior.)

Homework system for both instructors and self-guided study is available at this page.

While I wait for a hard copy to arrive, I have downloaded the PDF version.

November 15, 2011

Hadoop and Data Quality, Data Integration, Data Analysis

Filed under: Data Analysis,Data Integration,Hadoop — Patrick Durusau @ 7:58 pm

Hadoop and Data Quality, Data Integration, Data Analysis by David Loshin.

From the post:

If you have been following my recent thread, you will of course be anticipating this note, in which we examine the degree to which our favorite data-oriented activities are suited to the elastic yet scalable massive parallelism promised by Hadoop. Let me first summarize the characteristics of problems or tasks that are amenable to the programming model:

  1. Two-Phased (2-φ) – one or more iterations of “computation” followed by “reduction.”
  2. Big data – massive data volumes preclude using traditional platforms
  3. Data parallel (Data-||) – little or no data dependence
  4. Task parallel (Task-||) – task dependence collapsible within phase-switch from Map to Reduce
  5. Unstructured data – No limit on requiring data to be structured
  6. Communication “light” – requires limited or no inter-process communication except what is required for phase-switch from Map to Reduce

OK, so I happen to agree with David’s conclusions. (see his post for the table) That isn’t the only reason I posted this note.

Rather I think this sort of careful analysis lends itself to test cases, which we can post and share with specification of the tasks performed.

Much cleaner and more enjoyable than the debates measured by who can sink the lowest fastest.

Test cases to suggest anyone?

November 2, 2011

pandas: a Foundational Python Library for Data Analysis

Filed under: Data Analysis,Python — Patrick Durusau @ 6:25 pm

pandas: a Foundational Python Library for Data Analysis and Statistics by Wes McKinney

From the abstract:

In this paper we will discuss pandas, a Python library of rich data structures and tools for working with structured data sets common to statistics, finance, social sciences, and many other fields. The library provides integrated, intuitive routines for performing common data manipulations and analysis on such data sets. It aims to be the foundational layer for the future of statistical computing in Python. It serves as a strong complement to the existing scientific Python stack while implementing and improving upon the kinds of data manipulation tools found in other statistical programming languages such as R. In addition to detailing its design and features of pandas, we will discuss future avenues of work and growth opportunities for statistics and data analysis applications in the Python language.

A quick listing of things pandas does well (from pandas.sourceforge.net)

  • Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Intuitive merging and joining data sets
  • Flexible reshaping and pivoting of data sets
  • Hierarchical labeling of axes (possible to have multiple labels per tick)
  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDF5 format
  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.

Another data analysis library for your topic maps toolkit.

October 31, 2011

Using StackOverflow’s API to Find the Top Web Frameworks

Filed under: Data,Data Analysis,Searching,Visualization — Patrick Durusau @ 7:32 pm

Using StackOverflow’s API to Find the Top Web Frameworks by Bryce Boe.

From the post:

Adam and I are currently in the process of working on our research about the Execution After Redirect, or EAR, Vulnerability which I previously discussed in my blog post about the 2010 iCTF. While Adam is working on a static analyzer to detect EARs in ruby on rails projects, I am testing how simple it is for a developer to introduce an EAR vulnerability in several popular web frameworks. In order to do that, I first needed to come up with a mostly unbiased list of popular web frameworks.

My first thought was to perform a search on the top web frameworks hoping that the information I seek may already be available. This search provided a few interesting results, such as the site, Best Web-Frameworks as well as the page Framework Usage Statistics by the group BuiltWith. The Best Web-Frameworks page lists and compares various web frameworks by language, however it offers no means to compare the adoption of each. The Framework Usage Statistics page caught my eye as its usage statistics are generated by crawling and fingerprinting various websites in order to determine what frameworks are in use. Their fingerprinting technique, however, is too generic in some cases thus resulting in the labeling of languages like php and perl as frameworks. While these results were a step in the right direction, what I was really hoping to find was a list of top web frameworks that follow the model, view, controller, or MVC, architecture.

After a bit more consideration I realized it wouldn’t be very simple to get a list of frameworks by usage, thus I had to consider alternative metrics. I thought how I could measure the popularity of the framework by either the number of developers using or at least interested in the framework. It was this train of thought that lead me to both Google Trends and StackOverflow. Google Trends allows one to perform a direct comparison of various search queries over time, such as ruby on rails compared to python. The problem, as evidenced by the former link, is that some of the search queries don’t directly apply to the web framework; in this case not all the people searching for django are looking for the web framework. Because of this problem, I decided a more direct approach was needed.

StackOverflow is a website geared towards developers where they can go to ask questions about various programing languages, development environments, algorithms, and, yes, even web frameworks. When someone asks a question, they can add tags to the question to help guide it to the right community. Thus if I had a question about redirects in ruby on rails, I might add the tag ruby-on-rails. Furthermore if I was interested in questions other people had about ruby on rails I might follow the ruby-on-rails tag.

Bryce’s use of the StackOverflow’s API is likely to interest anyone creating topic maps on CS topics. Not to mention that his use of graphs for visualization is interesting as well.

October 27, 2011

AnalyticBridge

Filed under: Analytics,Bibliography,Data Analysis — Patrick Durusau @ 4:45 pm

AnalyticBridge: A Social Network for Analytic Professionals

Some interesting resources, possibly useful groups.

Anyone with experience with this site?

October 19, 2011

The Kepler Project

Filed under: Bioinformatics,Data Analysis,ELN Integration,Information Flow,Workflow — Patrick Durusau @ 3:16 pm

The Kepler Project

From the website:

The Kepler Project is dedicated to furthering and supporting the capabilities, use, and awareness of the free and open source, scientific workflow application, Kepler. Kepler is designed to help scien­tists, analysts, and computer programmers create, execute, and share models and analyses across a broad range of scientific and engineering disciplines. Kepler can operate on data stored in a variety of formats, locally and over the internet, and is an effective environment for integrating disparate software components, such as merging “R” scripts with compiled “C” code, or facilitating remote, distributed execution of models. Using Kepler’s graphical user interface, users simply select and then connect pertinent analytical components and data sources to create a “scientific workflow”—an executable representation of the steps required to generate results. The Kepler software helps users share and reuse data, workflows, and compo­nents developed by the scientific community to address common needs.

The Kepler software is developed and maintained by the cross-project Kepler collaboration, which is led by a team consisting of several of the key institutions that originated the project: UC Davis, UC Santa Barbara, and UC San Diego. Primary responsibility for achieving the goals of the Kepler Project reside with the Leadership Team, which works to assure the long-term technical and financial viability of Kepler by making strategic decisions on behalf of the Kepler user community, as well as providing an official and durable point-of-contact to articulate and represent the interests of the Kepler Project and the Kepler software application. Details about how to get more involved with the Kepler Project can be found in the developer section of this website.

Kepler is a java-based application that is maintained for the Windows, OSX, and Linux operating systems. The Kepler Project supports the official code-base for Kepler development, as well as provides materials and mechanisms for learning how to use Kepler, sharing experiences with other workflow developers, reporting bugs, suggesting enhancements, etc.

I found this from an announcement of an NSF grant for a bioKepler project.

Questions:

  1. Review the Kepler project and prepare a short summary of it. (3 – 5 pages)
  2. Workflow by its very nature involves subjects moving from one process or user to another. How is that handled by Kepler in general?
  3. Can you use intersect the workflow of Kepler with other workflow management software? If not, why not? (research project)

October 5, 2011

Datawrangler

Filed under: Data Analysis,Data Mining — Patrick Durusau @ 6:50 pm

Datawrangler

From the post:

Formatting data is a necessary pain, so anything that makes formatting easier is always welcome. Data Wrangler, from the Stanford Visualization Group, is the latest in the growing set of tools to get your data the way you need it (so that you can get to the fun part already). It’s similar to Google Refine in that they’re both browser-based, but my first impression is that Data Wrangler is more lightweight and it feels more responsive.

Data Wrangler also seems to do more guesswork, so you can set less specific parameters. Just roll over stuff, and it’ll show a preview of possible changes or formatting. Keep the change or easily undo it.

The video below describes what all the tool can do, but it’s better to just try it out. Copy and paste your own mangled data or give Data Wrangler a whirl with the sample provided.

From our friends at FlowingData. Perhaps we should ask: Does data exist if it isn’t visualized?

October 3, 2011

DataCleaner

Filed under: Data Analysis,Data Governance,Data Management,DataCleaner,Software — Patrick Durusau @ 7:08 pm

DataCleaner

From the website:

DataCleaner is an Open Source application for analyzing, profiling, transforming and cleansing data. These activities help you administer and monitor your data quality. High quality data is key to making data useful and applicable to any modern business.

DataCleaner is the free alternative to software for master data management (MDM) methodologies, data warehousing (DW) projects, statistical research, preparation for extract-transform-load (ETL) activities and more.

Err, “…cleansing data.”? Did someone just call topic maps name? 😉

If it is important to eliminate duplicate data, everyone using duplicated data needs updates and relationships to it. Unless the duplicated data was the result of poor design or just wasting drive space.

This looks like an interesting project and certainly one were topic maps are clearly relevant as one possible output.

September 24, 2011

How do I become a data scientist?

Filed under: Computer Science,Data Analysis,Data Science — Patrick Durusau @ 6:58 pm

How do I become a data scientist?

Whether you call yourself a “data scientist” or not is up to you.

Acquiring the skills relevant to your area of interest is the first step towards success with topic maps.

September 16, 2011

Building Data Science Teams

Filed under: Data,Data Analysis — Patrick Durusau @ 6:38 pm

Building Data Science Teams -The Skills, Tools, and Perspectives Behind Great Data Science Groups by DJ Patil.

From page 1:

Given how important data science has grown, it’s important to think about what data scientists add to an organization, how they fit in, and how to hire and build effective data science teams.

Nothing you probably haven’t heard before but a reminder isn’t a bad thing.

The tools to manipulate data are becoming commonplace. What remains and will remain elusive, will be the skills to use those tools well.

September 14, 2011

Don’t trust your instincts

Filed under: Data Analysis,Language,Recognition,Research Methods — Patrick Durusau @ 7:04 pm

I stumbled upon a review of: “The Secret Life of Pronouns: What Our Words Say About Us” by James W. Pennebaker in the New York Times Book Review, 28 August 2011.

Pennebaker is a word counter who first rule is: “Don’t trust your instincts.”

Why? In part because our expectations shape our view of the data. (sound familiar?)

The review quotes the Druge Report as posting a headline about President Obama that reads: “I ME MINE: Obama praises C.I.A. for bin Laden raid – while saying ‘I’ 35 Times.”

If the listener thinks President Obama is self-centered, the “I’s” have it as it were.

But, Pennebaker has used his programs to mindlessly count usage of words in press conferences since Truman. Obama is the lowest user I-word user of modern presidents.

That is only one illustration of how badly we can “look” at text or data and get it seriously wrong.

The Secret Life of Pronouns website has exercises to demonstrate how badly we get things wrong. (The videos are very entertaining.)

What does that mean for topic maps and authoring topic maps?

  1. Don’t trust your instincts. (courtesy of Pennebaker)
  2. View your data in different ways, ask unexpected questions.
  3. Ask people unfamiliar with your data how they view it.
  4. Read books on subjects you know nothing about. (Just general good advice.)
  5. Ask known unconventional people to question your data/subjects. (Like me! Sorry, consulting plug.)

September 13, 2011

Discovering, Summarizing and Using Multiple Clusterings

Filed under: Clustering,Data Analysis,Data Mining — Patrick Durusau @ 7:16 pm

Proceedings of the 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings
Athens, Greece, September 5, 2011.

This collection of papers reflects what I think is rapidly becoming the consensus view: There is no one/right way to look at data.

That is important because by the application of multiple techniques, in these papers clustering techniques, you may make unanticipated discoveries about your data. Recording the trail you followed, as all explorers should, will help others duplicate your steps, to test them or to go further. In topic map terms, I would you would be discovering and identifying subjects.

Edited by

Emmanuel Müller *
Stephan Günnemann **
Ira Assent ***
Thomas Seidl **

* Karlsruhe Institute of Technology, Germany
** RWTH Aachen University, Germany
*** Aarhus University, Denmark


Complete workshop proceedings as one file (~16 MB).

Table of Contents

    Invited Talks

  1. Combinatorial Approaches to Clustering and Feature Selection
    Michael E. Houle
  2. Cartification: Turning Similarities into Itemset Frequencies
    Bart Goethals
  3. Research Papers

  4. When Pattern Met Subspace Cluster
    Jilles Vreeken, Arthur Zimek
  5. Fast Multidimensional Clustering of Categorical Data
    Tengfei Liu, Nevin L. Zhang, Kin Man Poon, Yi Wang, Hua Liu
  6. Factorial Clustering with an Application to Plant Distribution Data
    Manfred Jaeger, Simon Lyager, Michael Vandborg, Thomas Wohlgemuth
  7. Subjectively Interesting Alternative Clusters
    Tijl De Bie
  8. Evaluation of Multiple Clustering Solutions
    Hans-Peter Kriegel, Erich Schubert, Arthur Zimek
  9. Browsing Robust Clustering-Alternatives
    Martin Hahmann, Dirk Habich, Wolfgang Lehner
  10. Generating a Diverse Set of High-Quality Clusterings
    Jeff M. Phillips, Parasaran Raman, Suresh Venkatasubramanian

September 12, 2011

Apache Camel

Filed under: Data Analysis,Data Engine,Data Integration — Patrick Durusau @ 8:25 pm

Apache Camel

New release as of 25 July 2011.

The Apache Camel site self describes as:

Apache Camel is a powerful open source integration framework based on known Enterprise Integration Patterns with powerful Bean Integration.

Camel lets you create the Enterprise Integration Patterns to implement routing and mediation rules in either a Java based Domain Specific Language (or Fluent API), via Spring based Xml Configuration files or via the Scala DSL. This means you get smart completion of routing rules in your IDE whether in your Java, Scala or XML editor.

Apache Camel uses URIs so that it can easily work directly with any kind of Transport or messaging model such as HTTP, ActiveMQ, JMS, JBI, SCA, MINA or CXF Bus API together with working with pluggable Data Format options. Apache Camel is a small library which has minimal dependencies for easy embedding in any Java application. Apache Camel lets you work with the same API regardless which kind of Transport used, so learn the API once and you will be able to interact with all the Components that is provided out-of-the-box.

Apache Camel has powerful Bean Binding and integrated seamless with popular frameworks such as Spring and Guice.

Apache Camel has extensive Testing support allowing you to easily unit test your routes.


….

So don’t get the hump, try Camel today! 🙂

Comments/suggestions?

I am going to be working through some of the tutorials and other documentation. Anything I should be looking for?

Apache Camel is a powerful open source integration framework based on known Enterprise Integration Patterns with powerful Bean Integration.

Camel lets you create the Enterprise Integration Patterns to implement routing and mediation rules in either a Java based Domain Specific Language (or Fluent API), via Spring based Xml Configuration files or via the Scala DSL. This means you get smart completion of routing rules in your IDE whether in your Java, Scala or XML editor.

Apache Camel uses URIs so that it can easily work directly with any kind of Transport or messaging model such as HTTP, ActiveMQ, JMS, JBI, SCA, MINA or CXF Bus API together with working with pluggable Data Format options. Apache Camel is a small library which has minimal dependencies for easy embedding in any Java application. Apache Camel lets you work with the same API regardless which kind of Transport used, so learn the API once and you will be able to interact with all the Components that is provided out-of-the-box.

Apache Camel has powerful Bean Binding and integrated seamless with popular frameworks such as Spring and Guice.

Apache Camel has extensive Testing support allowing you to easily unit test your routes.

September 3, 2011

DiscoverText

Filed under: Data Analysis,Data Mining,DiscoverText — Patrick Durusau @ 6:46 pm

DiscoverText

From the website:

DiscoverText helps you gain valuable insight about customers, products, employees, citizens, research data, and more through powerful text analytic methods. DiscoverText combines search, human judgments and inferences with automated software algorithms to create an active machine-learning loop.

DiscoverText is currently used for text analytics, market research, eDiscovery, FOIA processing, employee engagement analytics, health informatics, processing public comments by government agencies and university basic research.

Interesting tool set, based in the cloud.

« Newer PostsOlder Posts »

Powered by WordPress