Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

September 16, 2012

Machine Learning: Genetic Algorithms in Javascript Part 2

Filed under: Genetic Algorithms,Machine Learning — Patrick Durusau @ 1:05 pm

Machine Learning: Genetic Algorithms in Javascript Part 2 by Burak Kanber.

From the post:

Today we’re going to revisit the genetic algorithm. If you haven’t read Genetic Algorithms Part 1 yet, I strongly recommend reading that now. This article will skip over the fundamental concepts covered in part 1 — so if you’re new to genetic algorithms you’ll definitely want to start there.

Just looking for the example?

The Problem

You’re a scientist that has recently been framed for murder by an evil company. Before you flee the lab you have an opportunity to steal 1,000 pounds (or kilograms!) of pure elements from the chemical warehouse; your plan is to later sell them and survive off of the earnings.

Given the weight and value of each element, which combination should you take to maximize the total value without exceeding the weight limit?

This is called the knapsack problem. The one above is a one-dimensional problem, meaning the only constraint is weight. We could complicate matters by also considering volume, but we need to start somewhere. Note that in our version of the problem only one piece of each element is available, and each piece has a fixed weight. There are some knapsack problems where you can take unlimited platinum or up to 3 pieces of gold or something like that, but here we only have one of each available to us.

Why is this problem tough to solve? We’ll be using 118 elements. The brute-force approach would require that we test 2118 or 3.3 * 1035 different combinations of elements.

What if you have subject identity criteria of varying reliability? What is the best combination for the highest reliability?

To sharpen the problem: Your commanding officer has requested declaration of sufficient identity for a drone strike target.

Machine Learning: Genetic Algorithms Part 1 (Javascript)

Filed under: Genetic Algorithms,Javascript,Machine Learning — Patrick Durusau @ 12:37 pm

Machine Learning: Genetic Algorithms Part 1 (Javascript) by Burak Kanber.

From the post:

I like starting my machine learning classes with genetic algorithms (which we’ll abbreviate “GA” sometimes). Genetic algorithms are probably the least practical of the ML algorithms I cover, but I love starting with them because they’re fascinating and they do a good job of introducing the “cost function” or “error function”, and the idea of local and global optima — concepts both important and common to most other ML algorithms.

Genetic algorithms are inspired by nature and evolution, which is seriously cool to me. It’s no surprise, either, that artificial neural networks (“NN”) are also modeled from biology: evolution is the best general-purpose learning algorithm we’ve experienced, and the brain is the best general-purpose problem solver we know. These are two very important pieces of our biological existence, and also two rapidly growing fields of artificial intelligence and machine learning study. While I’m tempted to talk more about the distinction I make between the GA’s “learning algorithm” and the NN’s “problem solver” terminology, we’ll drop the topic of NNs altogether and concentrate on GAs… for now.

One phrase I used above is profoundly important: “general-purpose”. For almost any specific computational problem, you can probably find an algorithm that solves it more efficiently than a GA. But that’s not the point of this exercise, and it’s also not the point of GAs. You use the GA not when you have a complex problem, but when you have a complex problem of problems. Or you may use it when you have a complicated set of disparate parameters.

Off to a great start!

In Defense of the Power of Paper [Geography of Arguments/Information]

Filed under: Geography,Mapping,Maps,Marketing — Patrick Durusau @ 10:33 am

In her recent editorial, In Defense of the Power of Paper, Phyllis Korkk quotes Richard H. R. Harper saying:

Reading a long document on paper rather than on a computer screen helps people “better understand the geography of the argument contained within,” said Richard H. R. Harper, a principal researcher for Microsoft in Cambridge, England, and co-author with Abigail J. Sellen of “The Myth of the Paperless Office,” published in 2001.

Today’s workers are often navigating through multiple objects in complex ways and creating new documents as well, Mr. Harper said. Using more than one computer screen can be helpful for all this cognitive juggling. But when workers are going back and forth between points in a longer document, it can be more efficient to read on paper, he said. (emphasis added)

To “…understand the geography of the argument….”

I rather like that.

For all the debates about pointing, response codes, locators, identifiers, etc., on the web, all that was every at stake was document as blob.

Our “document as blob” schemes missed:

  • Complex complex relationships between documents
  • Tracking influences on both authors and readers
  • Their continuing but changing roles in the social life of information, and
  • The geography of arguments they contain (with at least as much complexity as documents as blobs).

Others may not be interested in the geography of arguments/information in your documents.

What about you?

Topic maps can help you break the “document as blob” barrier.

With topic maps you can plot the geography of/in your documents.

Interested?

Insufficiency and illusions

Filed under: Graphics,Visualization — Patrick Durusau @ 5:49 am

Insufficiency and illusions by Kaiser Fung.

From the post:

This WSJ graphic gives me a reason to talk about the self-sufficiency test: go ahead, and block out the data labels on the chart, you are left with concentric circles but no way to learn anything from the chart, not the absolute dollar values, nor the relative dollar values. In other words, the only way to read this chart is to look at the data labels.

Kaiser illustrates visuals that don’t really work as visuals.

Follow the “self-sufficiency test” link for a collection of poorly designed (not evil) graphics.

Spanner : Google’s globally distributed database

Filed under: Database,Distributed Systems — Patrick Durusau @ 5:39 am

Spanner : Google’s globally distributed database

From the post:

This paper, whose co-authors include Jeff Dean and Sanjay Ghemawat of MapReduce fame, describes Spanner. Spanner is Google’s scalable, multi-version, globally distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. Finally the paper comes out! Really exciting stuff!

Abstract from the paper:

Spanner is Google’s scalable, multi-version, globally-distributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions. This paper describes how Spanner is structured, its feature set, the rationale underlying various design decisions, and a novel time API that exposes clock uncertainty. This API and its implementation are critical to supporting external consistency and a variety of powerful features: non-blocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner.

Spanner: Google’s Globally Distributed Database (PDF File)

Facing user requirements, Google did not say: Suck it up and use tools already provided.

Google engineered new tools to meet their requirements.

Is there a lesson there for other software projects?

September 15, 2012

Working to change the world

Filed under: Philosophy — Patrick Durusau @ 7:42 pm

Working to change the world by John D. Cook.

From the post:

I recently read that Google co-founder Sergey Brin asked an audience whether they are working to change the world. He said that for 99.9999% of humanity, the answer is no.

I really dislike that question. It invites arrogance. Say yes and you’re one in a million. You’re a better person than the vast majority of humanity.

Focusing on doing enormous good can make us feel justified in neglecting small acts of goodness. Many have professed a love for Humanity and shown contempt for individual humans. “I’m trying to end poverty, cure cancer, and make the world safe for democracy; I shouldn’t be held to same petty standards as those who are wasting their lives.”

I don’t disagree with John’s post but I would emphasize the unknowability of the outcome of our actions.

Relieves me of worrying about tomorrow and its judgement in favor of today and its tasks.

Using Dashboards For Good or Evil:…

Filed under: Graphics,Visualization — Patrick Durusau @ 7:30 pm

Using Dashboards For Good or Evil: The Misrepresentation of Data

From the post:

“Making an evidence presentation is a moral act as well as an intellectual activity. To maintain standards of quality, relevance, and integrity for evidence, consumers of presentations should insist that presenters be held intellectually and ethically responsible for what they show and tell. Thus consuming a presentation is also an intellectual and a moral activity.” – Edward Tufte, Beautiful Evidence

You may think to yourself, how on Earth could dashboards be used for good or evil? Or you already know and you’re simply humoring us by reading this. Aside from the obvious, using it to measure how many evil things you’ve done (evil success vs. evil failures), there is the less obvious way in which you misrepresent the data you are displaying. The power that data visualization has to augment cognition can also be used, unfortunately, to distort reality.

You really need to read this post with the next one which concerns a post by John D. Cook.

The arrogance that I find in this post is the assumption that “good deed doers” can discern “reality” for all of us and spot “distortions” of that reality.

No doubt, data has representations and you should ask questions when data is presented to you. To say nothing of when you are tailoring a data presentation for a client.

But to assume the mantle of moral censor of data presentations seems to go a bit far. Does that qualify as arrogance?

Introductory FP Course Materials

Filed under: CS Lectures,Functional Programming,Parallel Programming,Programming — Patrick Durusau @ 7:20 pm

Introductory FP Course Materials by Robert Harper.

First semester introductory programming course.

Second semester data structures and algorithms course.

Deeply awesome body of material.

Enjoy!

Blame Google? Different Strategy: Let’s Blame Users! (Not!)

Let me quote from A Simple Guide To Understanding The Searcher Experience by Shari Thurow to start this post:

Web searchers have a responsibility to communicate what they want to find. As a website usability professional, I have the opportunity to observe Web searchers in their natural environments. What I find quite interesting is the “Blame Google” mentality.

I remember a question posed to me during World IA Day this past year. An attendee said that Google constantly gets search results wrong. He used a celebrity’s name as an example.

“I wanted to go to this person’s official website,” he said, “but I never got it in the first page of search results. According to you, it was an informational query. I wanted information about this celebrity.”

I paused. “Well,” I said, “why are you blaming Google when it is clear that you did not communicate what you really wanted?”

“What do you mean?” he said, surprised.

“You just said that you wanted information about this celebrity,” I explained. “You can get that information from a variety of websites. But you also said that you wanted to go to X’s official website. Your intent was clearly navigational. Why didn’t you type in [celebrity name] official website? Then you might have seen your desired website at the top of search results.”

The stunned silence at my response was almost deafening. I broke that silence.

“Don’t blame Google or Yahoo or Bing for your insufficient query formulation,” I said to the audience. “Look in the mirror. Maybe the reason for the poor searcher experience is the person in the mirror…not the search engine.”

People need to learn how to search. Search experts need to teach people how to search. Enough said.

What a novel concept! If the search engine/software doesn’t work, must be the user’s fault!

I can save you a trip down the hall to the marketing department. They are going to tell you that is an insane sales strategy. Satisfying to the geeks in your life but otherwise untenable, from a business perspective.

Remember the stats on using Library of Congress subject headings I posted under Subject Headings and the Semantic Web:

Overall percentages of correct meanings for subject headings in the original order of subdivisions were as follows: children, 32%, adults, 40%, reference 53%, and technical services librarians, 56%.

?

That is with decades of teaching people to search both manual and automated systems using Library of Congress classification.

Test Question: I have a product to sell. 60% of my all buyers can’t find it with a search engine. Do I:

  • Teach all users everywhere better search techniques?
  • Develop better search engines/interfaces to compensate for potential buyers’ poor searching?

I suspect the “stunned silence” was an audience with greater marketing skills than the speaker.

Linux cheat sheets [Unix Sets Anyone?]

Filed under: Linux OS,Set Intersection,Sets — Patrick Durusau @ 3:07 pm

Linux cheat sheets

John D. Cook points to three new Linux cheat sheets from Peteris Krumins:

While investigating, I ran across:

Set Operations in the Unix Shell Simplified

From that post:

Remember my article on Set Operations in the Unix Shell? I implemented 14 various set operations by using common Unix utilities such as diff, comm, head, tail, grep, wc and others. I decided to create a simpler version of that post that just lists the operations. I also created a .txt cheat-sheet version of it and to make things more interesting I added an Awk implementation of each set op. If you want a detailed explanations of each operation, go to the original article.

MapReduce is Good Enough?… [How to Philosophize with a Hammer?]

Filed under: BigData,Hadoop,MapReduce — Patrick Durusau @ 2:50 pm

MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! by Jimmy Lin.

Abstract:

Hadoop is currently the large-scale data analysis “hammer” of choice, but there exist classes of algorithms that aren’t “nails”, in the sense that they are not particularly amenable to the MapReduce programming model. To address this, researchers have proposed MapReduce extensions or alternative programming models in which these algorithms can be elegantly expressed. This essay espouses a very different position: that MapReduce is “good enough”, and that instead of trying to invent screwdrivers, we should simply get rid of everything that’s not a nail. To be more specific, much discussion in the literature surrounds the fact that iterative algorithms are a poor fit for MapReduce: the simple solution is to find alternative non-iterative algorithms that solve the same problem. This essay captures my personal experiences as an academic researcher as well as a software engineer in a “real-world” production analytics environment. From this combined perspective I reflect on the current state and future of “big data” research.

Following the abstract:

Author’s note: I wrote this essay specifically to be controversial. The views expressed herein are more extreme than what I believe personally, written primarily for the purposes of provoking discussion. If after reading this essay you have a strong reaction, then I’ve accomplished my goal 🙂

The author needs to work on being “controversial.” He gives away the pose “throw away everything not a nail” far too early and easily.

Without the warnings, flashing lights, etc, the hyperbole might be missed, but not by anyone who would benefit from the substance of the paper.

The paper reflects careful thought on MapReduce and its limitations. Merits a careful and close reading.

I first saw this mentioned by John D. Cook.

Nonparametric Techniques – Webinar [Think Movie Ratings]

Filed under: Nonparametric,Recommendation,Statistics — Patrick Durusau @ 2:30 pm

Overview of Nonparametric Techniques with Elaine Eisenbeisz.

Date: October 3, 2012

Time: 3pm Eastern Time UTC -4 (2pm Central, 1pm Mountain, 12pm Pacific)

From the description:

A distribution of data which is not normal does not mean it is abnormal. There are many data analysis techniques which do not require the assumption of normality.

This webinar will provide information on when it is best to use nonparametric alternatives and provides information on suggested tests to use in lieu of:

  • Independent samples and paired t-tests
  • Analysis of variance techniques
  • Pearson’s Product Moment Correlation
  • Repeated measures designs

A description of nonparametric techniques for use with count data and contingency tables will also be provided.

Movie ratings, a ranked population, are appropriate for nonparametric methods.

You just thought you didn’t know anything about nonparametric methods. 😉

Applicable to all ranked populations (can you say recommendation?).

While you wait for the webinar, try some of the references from Wikipedia: Nonparametric Statistics.

Factor Analysis: A Short Introduction, Part 1 [Reducing Dimensionality]

Filed under: Dimension Reduction,Factor Analysis — Patrick Durusau @ 2:01 pm

Factor Analysis: A Short Introduction, Part 1 by Maike Rahn.

From the post:

Why use factor analysis?

Factor analysis is a useful tool for investigating variable relationships for complex concepts such as socioeconomic status, dietary patterns, or psychological scales.

It allows researchers to investigate concepts that are not easily measured directly by collapsing a large number of variables into a few interpretable underlying factors.

What is a factor?

The key concept of factor analysis is that multiple observed variables have similar patterns of responses because of their association with an underlying latent variable, the factor, which cannot easily be measured.

For example, people may respond similarly to questions about income, education, and occupation, which are all associated with the latent variable socioeconomic status.

I mention factor analysis as an example of

  • reducing dimensionality
  • exchanging a not easily measured latent variable for measurable ones
  • attributing a relationship between a not easily measured latent variable and measurable ones

Factor analysis has been successfully used in a number of fields.

However, to reliably integrate information based on factor analysis you will need to probe the (often) unstated assumptions of such analysis.

PS: You may find the pointers in Wikipedia useful: Factor Analysis.

Visualize complex data with subplots

Filed under: Graphics,R,Visualization — Patrick Durusau @ 10:52 am

Visualize complex data with subplots by Garrett Grolemund.

From the post:

I think of graphs as a type of visual summary for data. Yet I rarely see graphs used this way within visualizations. Consider tile plots. They group data into 2d bins and then summarize each group with a number. This approach is a go-to tool for understanding overplotted data, but it discards a lot of information. Since we’re already using graphs, why not summarize the data in each bin visually? In the same space that we devote to a single colored tile, we can draw a subplot that retains enough information to display interesting patterns. Take, for example, this visualization of the WikiLeaks Afghanistan War Diary. It replaces each tile with a bar graph that shows the number of casualties by type for the specified region. We still get a sense of where the highest frequencies of casualties occur, but we can also see trends. For example, civilian casualties outnumber combatant casualties in the capital city of Kabul.

Garrett also uses subplots to visualize temperature data (1995-2001) that shows:

Surprisingly, the hottest places in the western hemisphere are not those near the equator.

Not useful for every case but subplots are a graphic technique you should keep in mind.

Wrapping Up TimesOpen: Sockets and Streams

Filed under: Data Analysis,Data Streams,node-js,Stream Analytics — Patrick Durusau @ 10:41 am

Wrapping Up TimesOpen: Sockets and Streams by Joe Fiore.

From the post:

This past Wednesday night, more than 80 developers came to the Times building for the second TimesOpen event of 2012, “Sockets and Streams.”

If you were one of the 80 developers, good for you! The rest of us will have to wait for the videos.

Links to the slides are given but a little larger helping of explanation would be useful.

Data streams have semantic diversity, just like static data, only less time to deal with it.

Ups the semantic integration bar.

Are you ready?

September 14, 2012

Should We Focus on User Experience?

Should We Focus on User Experience? by Koen Claes.

From the post:

In the next seven minutes or so, this article hopes to convince you that our current notion of UX design mistakenly focuses on experience, and that we should go one step further and focus on the memory of an experience instead.

Studies of behavioral economics have changed my entire perspective on UX design, causing me to question basic tenets. This has led to ponderings like: “Is it possible that trying to create ‘great experiences’ is pointless?” Nobel Prize-winning research seems to hint that it is.

Via concrete examples, additional research sources, and some initial how-to tips, I aim to illustrate why and how we should recalibrate our UX design processes.

You will also like the narrative (with addition resources) from Koen’s presentation at IA Summit 2011, On Why We Should NOT Focus on UX.

The more I learn about the myriad aspects of communcation, the more I am amazed that we communicate at all. 😉

RecSys 2012: Beyond Five Stars

Filed under: Conferences,Recommendation — Patrick Durusau @ 2:48 pm

RecSys 2012: Beyond Five Stars by Daniel Tunkelang.

From the post:

I spent the past week in Dublin attending the 6th ACM International Conference on Recommender Systems (RecSys 2012). This young conference has become the premier global forum for discussing the state of the art in recommender systems, and I’m thrilled to have has the opportunity to participate.

Daniel’s review of RecSys 2012 with lots of links and pointers!

It will take you some time to work through all the hyperlinks so it is a good thing the weekend is upon us!

Enjoy!

Who’s Really Using Big Data [Topic Maps As Silo Bungholes]

Filed under: BigData,Marketing,Topic Maps — Patrick Durusau @ 2:37 pm

Who’s Really Using Big Data by Paul Barth and Randy Bean. (Harvard Business Review)

From the post:

We recently surveyed executives at Fortune 1000 companies and large government agencies about where they stand on Big Data: what initiatives they have planned, who’s leading the charge, and how well equipped they are to exploit the opportunities Big Data presents. We’re still digging through the data — but we did come away with three high-level takeaways.

  • First, the people we surveyed have high hopes for what they can get out of advanced analytics.
  • Second, it’s early days for most of them. They don’t yet have the capabilities they need to exploit Big Data.
  • Third, there are disconnects in the survey results — hints that the people inside individual organizations aren’t aligned on some key issues.

The third point, disconnects, is addressed when the authors say:

Recall that 80% of respondents agreed that Big Data initiatives would reach across multiple lines of business. That reality bumps right up against the biggest data challenge respondents identified: “integrating a wider variety of data.” This challenge appears to be more apparent to IT than to business executives. We’d guess that they’re more aware of how silo’d their companies really are, and that this is another reason that they judge the company’s capacity to transform itself using Big Data more harshly.

I don’t know that “harshly” is the term I would use. Realistically is more accurate.

The eleventh anniversary of the 9/11 attacks in the U.S. just passed and improved intelligence sharing between U.S. intelligence agencies is still years away, if it remains on schedule. (Read’em and Weep)

Fact: Threat of death and destruction raining out of the sky is insufficient to promote information sharing beyond intelligence silos.

Question: What motivation are you going to use to promote information sharing beyond your silos?

De-siloing of information means:

  1. Loss of power – X doesn’t have to ask for my report
  2. Loss of control – Y might do something with my data that makes me look bad
  3. Loss of job security – I am the only person who knows how to obtain the data

Not to mention fear of change and a host of other nasty reactions. The ones who aren’t afraid are panting with lust for the data of others to strengthen their positions.

Which means nearly everyone in your organization is going to start with a minimum of passive resistance to de-siloing and escalate from there.

There are alternatives.

Why not let people keep their silos and breach them one by one with topic map bungholes?

What is the purpose of de-siloing of information? So we can use it with other information? Yes?

Which means we know what information we need for some particular purpose with a defined benefit. Yes?

In other words, making all your silos transparent is likely a waste of time, even if it could succeed.

Breaching a data silo with a topic map bunghole means specific information for some specified benefit. Amenable to cost/benefit analysis.

Which works better in your organization: High value, specific returns or “it could be valuable someday, we just don’t know,” diffuse returns?

Topic maps are the first option, transparent data silos are the second. Your call.

ESWC 2013 : 10th Extended Semantic Web Conference

Filed under: BigData,Linked Data,Semantic Web,Semantics — Patrick Durusau @ 1:24 pm

ESWC 2013 : 10th Extended Semantic Web Conference

Important Dates:

Abstract submission: December 5th, 2012

Full paper submission: December 12th, 2012

Authors’ rebuttals: February 11th-12th, 2013

Acceptance Notification: February 22nd, 2013

Camera ready: March 9th, 2013

Conference: May 26th-30th, 2013

From the call for papers:

ESWC is the premier European-based annual conference for researchers and practitioners in the field of semantic technologies. ESWC is the ideal venue for the discussion of the latest scientific insights and novel applications of semantic technologies.

The leading motto of the 10th edition of ESWC will be “Semantics and Big Data”. A crucial challenge that will guide the efforts of many scientific communities in the years to come is the one of making sense of large volumes of heterogeneous and complex data. Application-relevant data often has to be processed in real time and originates from diverse sources such as Linked Data, text and speech, images, videos and sensors, communities and social networks, etc. ESWC, with its focus on semantics, can offer an important contribution to global challenge.

ESWC 2013 will feature nine thematic research tracks (see below) as well as an in-use and industrial track. In line with the motto “Semantics and Big Data”, the conference will feature a special track on “Semantic Technologies for Big Data Analytics in Real Time”. In order to foster the interaction with other disciplines, this year’s edition will also feature a special track on “Cognition and Semantic Web”.

For the research and special tracks, we welcome the submission of papers describing theoretical, analytical, methodological, empirical, and application research on semantic technologies. For the In-Use and Industrial track we solicit the submission of papers describing the practical exploitation of semantic technologies in different domains and sectors. Submitted papers should describe original work, present significant results, and provide rigorous, principled, and repeatable evaluation. We strongly encourage and appreciate the submission of papers including links to data sets and other material used for the evaluation as well as to live demos or source code for tool implementations.

Submitted papers will be judged based on originality, awareness of related work, potential impact on the Semantic Web field, technical soundness of the proposed methods, and readability. Each paper will be reviewed by at least three program committee members in addition to one track chair. This year a rebuttal phase has been introduced in order to give authors the opportunity to provide feedback to reviewers’ questions. The authors’ answers will support reviewers and track chairs in their discussion and in taking final decisions regarding acceptance.

I would call your attention to:

A crucial challenge that will guide the efforts of many scientific communities in the years to come is the one of making sense of large volumes of heterogeneous and complex data.

Sounds like they are playing the topic map song!

Ping me if you are able to attend and would like to collaborate on a paper.

JMyETL

Filed under: CUBRID,ETL,MySQL,Oracle,PostgreSQL,SQL Server — Patrick Durusau @ 1:15 pm

JMyETL, an easy to use ETL tool that supports 10 different RDBMS by Esen Sagynov.

From the post:

JMyETL is a very useful and simple Java based application for Windows OS which allows users to import and export data from/to various database systems. For example:

  • CUBRID –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, DB2, Access, SQLite
  • MySQL –> Sybase ASE/ASA, Oracle, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Sybase ASE –> Sybase ASA, MySQL, Oracle, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Sybase ASA –> Sybase ASE, MySQL, Oracle, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Oracle –> Sybase ASA, Sybase ASE, MySQL, Access, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • Access –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, DB2, SQLite, CUBRID
  • PostgreSQL –> Sybase ASE, Sybase ASA, MySQL, Oracle, Access, SQL Server, DB2, SQLite, CUBRID
  • SQL Server –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, Access, DB2, SQLite, CUBRID
  • DB2 –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, Access, SQLite, CUBRID
  • SQLite –> Sybase ASE, Sybase ASA, MySQL, Oracle, PostgreSQL, SQL Server, DB2, Access, CUBRID

Just in case you need a database to database ETL utility.

I first saw this at DZone.

First Party Fraud (In Four Parts)

Filed under: Business Intelligence,Graphs,Networks,Social Graphs,Social Networks — Patrick Durusau @ 1:00 pm

Mike Betron as written a four-part series on first party fraud that merits your attention:

First Part Fraud [Part 1]

What is First Party Fraud?

First-party fraud (FPF) is defined as when somebody enters into a relationship with a bank using either their own identity or a fictitious identity with the intent to defraud. First-party fraud is different from third-party fraud (also known as “identity fraud”) because in third-party fraud, the perpetrator uses another person’s identifying information (such as a social security number, address, phone number, etc.). FPF is often referred to as a “victimless” crime, because no consumers or individuals are directly affected. The real victim in FPF is the bank, which has to eat all of the financial losses.

First-Party Fraud: How Do We Assess and Stop the Damage? [Part 2]

Mike covers the cost of first party fraud and then why it is so hard to combat.

Why is it so hard to detect FPF?

Given the amount of financial pain incurred by bust-out fraud, you might wonder why banks haven’t developed a solution and process for detecting and stopping it.

There are three primary reasons why first-party fraud is so hard to identify and block:

1) The fraudsters look like normal customers

2) The crime festers in multiple departments

3) The speed of execution is very fast

Fighting First Party Fraud With Social Link Analysis (3 of 4)

And you know, those pesky criminals won’t use their universally assigned identifiers for financial transactions. (Any security system that relies on good faith isn’t a security system, it’s an opportunity.)

A Trail of Clues Left by Criminals

Although organized fraudsters are sophisticated, they often leave behind evidence that can be used to uncover networks of organized crime. Fraudsters know that due to Know Your Customer (KYC) and Customer Due Diligence (CDD) regulations, their identification will be verified when they open an account with a financial institution. To pass these checks, the criminals will either modify their own identity slightly or else create a synthetic identity, which consists of combining real identity information (e.g., a social security number) with fake identity information (names, addresses, phone numbers, etc.).

Fortunately for banks, false identity information can be expensive and inconvenient to acquire and maintain. For example, apartments must be rented out to maintain a valid address. Additionally, there are only so many cell phones a person can carry at one time and only so many aliases that can be remembered. Because of this, fraudsters recycle bits and pieces of these valuable assets.

The reuse of identity information has inspired Infoglide to begin to create new technology on top of its IRE platform called Social Link Analysis (SLA). SLA works by examining the “linkages” between the recycled identities, therefore identifying potential fraud networks. Once the networks are detected, Infoglide SLA applies advanced analytics to determine the risk level for both the network and for every individual associated with that network.

First Party Fraud (post 4 of 4) – A Use Case

As discussed in our previous blog in this series, Social Link Analysis works by identifying linkages between individuals to create a social network. Social Link Analysis can then analyze the network to identify organized crime, such as bust-out fraud and internal collusion.

During the Social Link Analysis process, every individual is connected to a single network. An analysis at a large tier 1 bank will turn up millions of networks, but the majority of individuals only belong to very small networks (such as a husband and wife, and possibly a child). However, the social linking process will certainly turn up a small percentage of larger networks of interconnected individuals. It is in these larger networks where participants of bust-out fraud are hiding.

Due to the massive number of networks within a system, the analysis is performed mathematically (e.g. without user interface) and scores and alerts are generated. However, any network can be “visualized” using the software to create a graphic display of information and connections. In this example, we’ll look at a visualization of a small network that the social link analysis tool has alerted as a possible fraud ring.

A word of caution.

To leap from the example individuals being related to each other to:

As a result, Social Link Analysis has detected four members of a network, each with various amounts of charged-off fraud.

Is quite a leap.

Having charged off loans, with re-use of telephone numbers and a mobile population, doesn’t necessarily mean anyone is guilty of “charged-off fraud.”

Could be, but you should tread carefully and with legal advice before jumping to conclusions of fraud.

For good customer relations, if not avoiding bad PR and legal liability.

PS: Topic maps can help with this type of data. Including mapping in the bank locations or even personnel who accepted particular loans.

Looking for MongoDB users to test Fractal Tree Indexing

Filed under: Fractal Trees,Indexing,MongoDB,Tokutek — Patrick Durusau @ 10:03 am

Looking for MongoDB users to test Fractal Tree Indexing by Tim Callaghan.

In my three previous blogs I wrote about our implementation of Fractal Tree Indexes on MongoDB, showing a 10x insertion performance increase, a 268x query performance increase, and a comparison of covered indexes and clustered indexes. The benchmarks show the difference that rich and efficient indexing can make to your MongoDB workload.

It’s one thing for us to benchmark MongoDB + TokuDB and another to measure real world performance. If you are looking for a way to improve the performance or scalability of your MongoDB deployment, we can help and we’d like to hear from you. We have a preview build available for MongoDB v2.2 that you can run with your existing data folder, drop/add Fractal Tree Indexes, and measure the performance differences. Please email me at tim@tokutek.com if interested.

Here is your chance to try these speed improvements out on your data!

Tweet Feeds For Topic Maps?

Filed under: Topic Map Software,Topic Maps,Tweets — Patrick Durusau @ 9:42 am

The Twitter Trend lecture will leave you with a number of ideas about tracking tweets.

It occurred to me watching the video that a Twitter stream could be used as a feed into a topic map.

Not the same as converting a tweet feed into a topic map, where you accept all tweets on some specified condition.

No, more along the lines that the topic map application watches for tweets from particular users or from particular users with specified hash tags, and when observed, adds information to a topic map.

Thinking such a feed mechanism could have templates that are invoked based upon hash tags for the treatment of tweet content or to marshal other information to be included in the map.

For example, I tweet: doi:10.3789/isqv24n2-3.2012 #tmbib .

A TM application recognizes the #tmbib, invokes a topic map bibliography template, uses the DOI to harvests the title, author, abstract, creates appropriate topics. (Or whatever your template is designed to do.)

Advantage: I don’t have to create and evangelize a new protocol for communication with my topic maps.

Advantage: Someone else is maintaining the pipe. (Not to be underestimated.)

Advantage: Tweet software is nearly ubiquitous.

Do you see a downside to this approach?

Kostas T. on How To Detect Twitter Trends

Filed under: Machine Learning,Tweets — Patrick Durusau @ 9:14 am

Kostas T. on How To Detect Twitter Trends by Marti Hearst.

From the post:

Have you ever wondered how Twitter computes its Trending Topics? Kostas T. is one of the wizards behind that, and today he shared some of the secrets with our class:

Be prepared to watch this more than once!

Sparks a number of ideas about how to track and analyze tweets.

Learning C with gdb

Filed under: Programming — Patrick Durusau @ 5:25 am

Learning C with gdb by Alan O’Donnell.

From the post:

Coming from a background in higher-level languages like Ruby, Scheme, or Haskell, learning C can be challenging. In addition to having to wrestle with C’s lower-level features like manual memory management and pointers, you have to make do without a REPL. Once you get used to exploratory programming in a REPL, having to deal with the write-compile-run loop is a bit of a bummer.

It occurred to me recently that I could use gdb as a pseudo-REPL for C. I’ve been experimenting with using gdb as a tool for learning C, rather than merely debugging C, and it’s a lot of fun.

My goal in this post is to show you that gdb is a great tool for learning C. I’ll introduce you to a few of my favorite gdb commands, and then I’ll demonstrate how you can use gdb to understand a notoriously tricky part of C: the difference between arrays and pointers.

The other day I read about a root kit that you can rent by the day.

If that doesn’t make you nervous about the code on your computer, it should.

Learning C won’t protect you from root kit renters, but it may make you a better programmer.

And poking around in memory locations can make you more aware of tradeoffs made by applications.

Which may or may not fit your data/needs.

September 13, 2012

…Milton Friedman’s thermostat [Perils of Observation]

Filed under: Semantics — Patrick Durusau @ 4:42 pm

Why are (almost all) economists unaware of Milton Friedman’s thermostat?

Skipping past a long introduction, here’s the beef:

Everybody knows that if you press down on the gas pedal the car goes faster, other things equal, right? And everybody knows that if a car is going uphill the car goes slower, other things equal, right?

But suppose you were someone who didn’t know those two things. And you were a passenger in a car watching the driver trying to keep a constant speed on a hilly road. You would see the gas pedal going up and down. You would see the car going downhill and uphill. But if the driver were skilled, and the car powerful enough, you would see the speed stay constant.

So, if you were simply looking at this particular “data generating process”, you could easily conclude: “Look! The position of the gas pedal has no effect on the speed!”; and “Look! Whether the car is going uphill or downhill has no effect on the speed!”; and “All you guys who think that gas pedals and hills affect speed are wrong!”

And no, you can not get around this problem by doing a multivariate regression of speed on gas pedal and hill. That’s because gas pedal and hill will be perfectly colinear. And no, you do not get around this problem simply by observing an unskilled driver who is unable to keep the speed perfectly constant. That’s because what you are really estimating is the driver’s forecast errors of the relationship between speed gas and hill, and not the true structural relationship between speed gas and hill. And it really bugs me that people who know a lot more econometrics than I do think that you can get around the problem this way, when you can’t. And it bugs me even more that econometricians spend their time doing loads of really fancy stuff that I can’t understand when so many of them don’t seem to understand Milton Friedman’s thermostat. Which they really need to understand.

If the driver is doing his job right, and correctly adjusting the gas pedal to the hills, you should find zero correlation between gas pedal and speed, and zero correlation between hills and speed. Any fluctuations in speed should be uncorrelated with anything the driver can see. They are the driver’s forecast errors, because he can’t see gusts of headwinds coming. And if you do find a correlation between gas pedal and speed, that correlation could go either way. A driver who over-estimates the power of his engine, or who under-estimates the effects of hills, will create a correlation between gas pedal and speed with the “wrong” sign. He presses the gas pedal down going uphill, but not enough, and the speed drops.

What you “observe” is dependent upon information you have learned outside the immediate situation.

And therefore changes the semantics of statements that you make to others about your observations. Their interpretations of your statements are also dependent upon other information.

There is no cure all for this type of issue, but being aware of it impacts our chances to avoid it. Maybe. 😉

The US poverty map in 2011 [Who Defines Poverty?]

Filed under: Data,Semantics — Patrick Durusau @ 4:18 pm

The US poverty map in 2011 by Simon Rogers.

From the post:

New figures from the US census show that 46.2 million Americans live in poverty and another 48.6m have no health insurance. In Maryland, the median income is $68,876, in Kentucky it is $39,856, some $10,054 below than the US average. Click on each state below to see the data – or use the dropdown to see the map change

As always an interesting presentation of data (along with access to the raw data).

But what about “poverty” in the United States versus “poverty” in other places?

The World Bank’s “Poverty” page reports in part:

  • Poverty headcount ratio at $1.25 a day (PPP) (% of population)
    • East Asia & Pacific
    • Europe & Central Asia
    • Latin America & Caribbean
    • Middle East & North Africa
    • South Asia
    • Sub-Saharan Africa
  • Poverty headcount ratio at $2 a day (PPP) (% of population)
    • East Asia & Pacific
    • Europe & Central Asia
    • Latin America & Caribbean
    • Middle East & North Africa
    • South Asia
    • Sub-Saharan Africa

What area is missing from this list?

Can you say: “North America?”

The poverty rate per day for North American is an important comparison point in discussions of global trade, environment and similar issues.

Can you point me towards more comprehensive comparison data?


PS: $2 per day is $730 annual. $1.25 per day is $456.25 annual.

Mule ESB 3.3.1

Filed under: Data Integration,Data Management,Mule — Patrick Durusau @ 3:34 pm

Mule ESB 3.3.1 by Ramiro Rinaudo.

I got the “memo” on 4 September 2012 but it got lost in my inbox. Sorry.

From the post:

Mule ESB 3.3.1 represents a significant amount of effort on the back of Mule ESB 3.3 and our happiness with the result is multiplied by the number of products that are part of this release. We are releasing new versions with multiple enhancements and bug fixes to all of the major stack components in our Enterprise Edition. This includes:

Europeana opens up data on 20 million cultural items

Filed under: Archives,Data,Dataset,Europeana,Library,Museums — Patrick Durusau @ 3:25 pm

Europeana opens up data on 20 million cultural items by Jonathan Gray (Open Knowledge Foundation):

From the post:

Europe‘s digital library Europeana has been described as the ‘jewel in the crown’ of the sprawling web estate of EU institutions.

It aggregates digitised books, paintings, photographs, recordings and films from over 2,200 contributing cultural heritage organisations across Europe – including major national bodies such as the British Library, the Louvre and the Rijksmuseum.

Today [Wednesday, 12 September 2012] Europeana is opening up data about all 20 million of the items it holds under the CC0 rights waiver. This means that anyone can reuse the data for any purpose – whether using it to build applications to bring cultural content to new audiences in new ways, or analysing it to improve our understanding of Europe’s cultural and intellectual history.

This is a coup d’etat for advocates of open cultural data. The data is being released after a grueling and unenviable internal negotiation process that has lasted over a year – involving countless meetings, workshops, and white papers presenting arguments and evidence for the benefits of openness.

That is good news!

A familiar issue that it overcomes:

To complicate things even further, many public institutions actively prohibit the redistribution of information in their catalogues (as they sell it to – or are locked into restrictive agreements with – third party companies). This means it is not easy to join the dots to see which items live where across multiple online and offline collections.

Oh, yeah! That was one of Google’s reasons for pulling the plug on the Open Knowledge Graph. Google had restrictive agreements so you can only connect the dots with Google products. (I think there is a name for that, let me think about it. Maybe an EU prosecutor might know it. You could always ask.)

What are you going to be mapping from this collection?

Integrate data and reporting on the Web with knitr

Filed under: R,TeX/LaTeX — Patrick Durusau @ 2:50 pm

Integrate data and reporting on the Web with knitr by Yihui Xie.

From the post:

Hi, this is Yihui Xie, and I’m guest posting on the Revolutions blog to talk about one aspect of the knitr package: how we can integrate data analysis and reporting in R with the Web. This post includes both the work that has been done and the ongoing work. For those who have no idea of knitr, it is an R package to generate reports dynamically from the mixture of computer code and narratives. It is available on CRAN and Github.

Good set of resources on knitr, an R package for dynamic report generation.

You may find yourself using R for exploration as well as delivery of content.

Exploration of data involves delivery of content too. Just a different audience.

« Newer PostsOlder Posts »

Powered by WordPress