Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 13, 2014

Map projections illustrated with a face

Filed under: Graphs,Visualization — Patrick Durusau @ 7:35 pm

Map projections illustrated with a face by Nathan Yau.

Nathan has images from a book published in 1921 that illustrate map projections by using a single face in four (4) separate projections.

What if you were to illustrate a machine learning technique that tops out at say 70% accuracy by displaying a random 70% of the customers face?

For example:

patrick

versus:

patrick-70

Hmmm, 70% accuracy doesn’t look all that great. šŸ˜‰

What other data operations would you illustrate with images?

What images would you use?

The myth of the aimless data explorer

Filed under: Bias,Data — Patrick Durusau @ 7:14 pm

The myth of the aimless data explorer by Enrico Bertini.

From the post:

There is a sentence I have heard or read multiple times in my journey into (academic) visualization: visualization is a tool people use when they donā€™t know what question to ask to their data.

I have always taken this sentence as a given and accepted it as it is. Good, I thought, we have a tool to help people come up with questions when they have no idea what to do with their data. Isnā€™t that great? It sounded right or at least cool.

But as soon as I started working on more applied projects, with real people, real problems, real data they care about, I discovered this all excitement for data exploration is just not there. People working with data are not excited about ā€œplayingā€ with data, they are excited about solving problems. Real problems. And real problems have questions attached, not just curiosity. Thereā€™s simply nothing like undirected data exploration in the real world.

I think Enrico misses the reason why people use/like the phrase: visualization is a tool people use when they donā€™t know what question to ask to their data.

Visualization privileges the “data” as the source of whatever result is displayed by the visualization.

It’s not me! That’s what the data says!

Hardly. Someone collected the data. Not at random, stuffing whatever bits came along in a bag. Someone cleaned the data with some notion of what “clean” meant. Someone choose the data that is now being called upon for a visualization. And those are clumsy steps that collapse many distinct steps into only three.

To put it another way, data never exists without choices being made. And it is the sum of those choices that influence the visualizations that are even possible from some data set.

The short term for what Enrico overlooks is bias.

I would recast his title to read: The myth of the objective data explorer.

Having said that, I don’t mean that all bias is bad.

If I were collecting data on Ancient Near Eastern (ANE) languages, I would of necessity be excluding the language traditions of the entire Western Hemisphere. It could even be that data from the native cultures of the Western Hemisphere will be lost while I am preserving data from the ANE.

So we have bias and a bad outcome, from someone’s point of view because of that bias. Was that a bad thing? I would argue not.

It isn’t every possible to collect all the potential data that can be collected. We all make values judgments about the data we choose to collect and what we choose to ignore.

Rather than pretending that we possess objectivity in any meaningful sense, we are better off to state our biases to the extent we know them. At least others will be forewarned that we are just like them.

Euler’s Seven Bridges X Seven Minus 2

Filed under: Graphs,Maps — Patrick Durusau @ 5:34 pm

We Used New York City’s 47 Bridges To Solve An 18th Century Math Puzzle by Andy Kiersz.

From the post:

The George Washington Bridge isn’t the only way to get from one landmass to another in New York City.

NYC is built on an archipelago, and consequently has a ton of bridges. There are 47 non-rail-only bridges in New York City that appear on Wikipedia’s list of said bridges.

In this exercise, we answer: Is it possible to get around NYC by crossing every bridge just once?

This is more than just a fun math puzzle. The process for answering this question eventually led to modern-day, real-world applications that couldn’t have been imagined when a similar question was first posed nearly 300 years ago.

A highly entertaining examination of how to solve the Seven Bridges of Koenigsburg for the “47ish Bridges of NYC.”

Profusely illustrated with maps to help you follow the narration.

Good introductory material on graphs.

Would need supplementing to strengthen the cases for graphs being important. For example, “relationships between people on social networking sites” can be modeled as a graph, doesn’t really capture the imagination.

Whereas, your relationships to other people in high school, college, work and on social network sites can be represented in a graph, might provoke a more visceral reaction.

BPP: Large Graph Storage for Efficient Disk Based Processing

Filed under: GraphChi,Graphs — Patrick Durusau @ 5:04 pm

BPP: Large Graph Storage for Efficient Disk Based Processing by Kamran Najeebullah, Kifayat Ullah Khan, Waqas Nawaz, and Young-Koo Lee.

Abstract:

Processing very large graphs like social networks, biological and chemical compounds is a challenging task. Distributed graph processing systems process the billion-scale graphs efficiently but incur overheads of efficient partitioning and distribution of the graph over a cluster of nodes. Distributed processing also requires cluster management and fault tolerance. In order to overcome these problems GraphChi was proposed recently. GraphChi significantly outperformed all the representative distributed processing frameworks. Still, we observe that GraphChi incurs some serious degradation in performance due to 1) high number of non-sequential I/Os for processing every chunk of graph; and 2) lack of true parallelism to process the graph. In this paper we propose a simple yet powerful engine BiShard Parallel Processor (BPP) to efficiently process billions-scale graphs on a single PC. We extend the storage structure proposed by GraphChi and introduce a new processing model called BiShard Parallel (BP). BP enables full CPU parallelism for processing the graph and significantly reduces the number of non-sequential I/Os required to process every chunk of the graph. Our experiments on real large graphs show that our solution significantly outperforms GraphChi.

…[B]illion-scale graph on a single PC.

Cool!

Err, but the experimental results in the paper are based on “7 thousand plus vertices and more than 1 hundred thousand edges.”

I’m not sure how I get to a “billion-scale” graph?

Interesting results and quite possibly will lead to other breakthroughs in graph processing.

A bit more attention to making the abstract match the results would be appreciated.

Not to mention finding acronyms that don’t conflict with better known ones, like “BP.”

Searching for “BP” isn’t likely to find this paper even in a very long tail of results.

I first saw this in a tweet by Stefano Bertolo.

Embedded System Insurance?

Filed under: NSA,Security — Patrick Durusau @ 11:48 am

Security Risks of Embedded Systems by Bruce Schneier.

From the post:

We’re at a crisis point now with regard to the security of embedded systems, where computing is embedded into the hardware itself — as with the Internet of Things. These embedded computers are riddled with vulnerabilities, and there’s no good way to patch them.

….

If we don’t solve this soon, we’re in for a security disaster as hackers figure out that it’s easier to hack routers than computers. At a recent Def Con, a researcher looked at thirty home routers and broke into half of them — including some of the most popular and common brands.

Bruce does a great job of explaining the embedded systems market and the lack of economic incentives to improve the security of embedded systems.

Where I disagree with Bruce is when he says:

The economic incentives point to large ISPs as the driver for change. Whether they’re to blame or not, the ISPs are the ones who get the service calls for crashes. They often have to send users new hardware because it’s the only way to update a router or modem, and that can easily cost a year’s worth of profit from that customer. This problem is only going to get worse, and more expensive. Paying the cost up front for better embedded systems is much cheaper than paying the costs of the resultant security disasters.

Large ISPs are an easy target but it would federal legislation to impose a uniform responsibility for embedded systems and what liability an ISP would incur for failure to upgrade. That is ignoring international issues with regard to ISPs. Not to mention not all “embedded systems” are routers. Who is responsible for all other “embedded systems?” Sounds like a sticky wicket that will take longer than the WWW has been around to solve.

A non-starter in other words.

We already have mechanisms in place to create the economic incentives Bruce is looking for, it’s called insurance.

If you have purchased anything at Target recently, you have probably been offered “replacement insurance

Protect every important purchase with a Target Replacement Plan and weā€™ll help get your covered breakdown resolved. If your product qualifies for replacement, we will issue you a Target Gift Card for the original purchase price. You can then replace your non-working product with a new oneā€”perhaps even the latest version!*

This plan protects your new product against common failures, and protects you from unexpected repair bills. Coverage is for 2 years, starting from the date of purchase, inclusive of the original manufacturerā€™s warranty.*

What if the sales of embedded systems were accompanied by an offer of embedded system insurance?

That would be insurance that will pay for either replacement or repair in the event of a security flaw in software or hardware of the embedded system. Where would the economic incentives be then?

Insurers will have an incentive to reduce their economic risk so they will be testing products, visiting manufacturers, funding research, etc., so they can make good decisions on their risk for particular products.

At the same time, government and industry, having the most to lose from security breaches, can refuse to buy embedded systems that are not insurable or that are insurable but have a higher premium. That would have the happy consequence of driving questionable manufacturers from the embedded systems marketplace.

The practical advantage to embedded system insurance is it only takes demand for insurable embedded system products to start the process.

Demand will attract insurers into the marketplace, local security policies will drive purchasing insured products, and when breaches are found (there is no magic bullet), customers will no disincentives to upgrading.

It won’t be quite that smooth but it has the advantage of no mandated NSA backdoors in the replacement software/embedded systems.

January 12, 2014

Porn capital of the porn nation

Filed under: Data,Porn,R — Patrick Durusau @ 9:09 pm

Porn capital of the porn nation by Gianluca Baio.

From the post:

The other day I was having a quick look to the newspapers and I stumbled on this article. Apparently, Pornhub (a website whose mission should be pretty clear) have analysed the data on their customers and found out that the town of Ware (Hertfordshire) has more demand for online porn than any other UK town. According to PornHub, a Ware resident will last 10 minutes 37 seconds (637 seconds) on its adult website, compared with the world average time of 8 minutes 56 seconds (just 536 seconds).

Gianluca walks you through data available from the Guardian with R, so you can reach your own conclusions.

I need to install Tableau Public before I can download the data set. Will update this post tomorrow.

Enjoy!

Update:

I installed Tableau Public on a Windows XP VM and then downloaded the data file. Turns out with the public version of Tableau there is no open local file option but if you double-click on the file, it will load and open.

Amusing but limited data set. Top five searches, etc.

The Porn Hub Stats page has other reports from the Porn Hub stats crew.

No data downloads for stats, tags, etc., although I did post a message to them asking about that sort of data.

I have just started playing with it but Tableau appears to be a really nice data visualization tool.

Musopen

Filed under: Data,Music — Patrick Durusau @ 8:53 pm

Musopen

From the webpage:

Musopen (www.musopen.org) is a 501(c)(3) non-profit focused on improving access and exposure to music by creating free resources and educational materials. We provide recordings, sheet music, and textbooks to the public for free, without copyright restrictions. Put simply, our mission is to set music free.

The New Grove Dictionary of Music and Musicians it’s not but losing our musical heritage did not happen over night.

Nor will winning it back.

Contribute to and support Musopen.

The Road to Summingbird:…

Filed under: Hadoop,MapReduce,Summingbird,Tweets — Patrick Durusau @ 8:37 pm

The Road to Summingbird: Stream Processing at (Every) Scale by Sam Ritchie.

Description:

Twitter’s Summingbird library allows developers and data scientists to build massive streaming MapReduce pipelines without worrying about the usual mess of systems issues that come with realtime systems at scale.

But what if your project is not quite at “scale” yet? Should you ignore scale until it becomes a problem, or swallow the pill ahead of time? Is using Summingbird overkill for small projects? I argue that it’s not. This talk will discuss the ideas and components of Summingbird that you could, and SHOULD, use in your startup’s code from day one. You’ll come away with a new appreciation for monoids and semigroups and a thirst for abstract algebra.

A slide deck that will make you regret missing the presentation.

I wasn’t able to find a video of Sam’s presentation at Data Day Texas 2014, but I did find a collection of his presentations, including some videos, at: http://sritchie.github.io/.

Valuable lessons for startups and others.

Coeffects: The next big programming challenge

Filed under: Context,Context-aware,F#,Functional Programming — Patrick Durusau @ 8:20 pm

Coeffects: The next big programming challenge by Tomas Petricek.

From the post:

Context-aware programming matters

The phrase context in which programs are executed sounds quite abstract and generic. What are some concrete examples of such context? For example:

  • When writing a cross-platform application, different platforms (and even different versions of the same platform) provide different contexts – the API functions that are available.
  • When creating a mobile app, the different capabilities that you may (or may not) have access to are context (GPS sensor, accelerometer, battery status).
  • When working with data (be it sensitive database or social network data from Facebook), you have permissions to access only some of the data (depending on your identity) and you may want to track provenance information. This is another example of a context.

These are all fairly standard problems that developers deal with today. As the number of devices where programs need to run increases, dealing with diverse contexts will be becoming more and more important (and I’m not even talking about ubiquitous computing where you need to compile your code to a coffee machine).

We do not preceive the above things as problems (at best, annoyances that we just have to deal with), because we do not realize that there should be a better way. Let me digg into four examples in a bit more detail.

This post is a good introduction to Tomas’ academic work.

A bit further on Tomas explains what he means by “coeffects:”

Coeffects: Towards context-aware languages

The above examples cover a couple of different scenarios, but they share a common theme – they all talk about some context in which an expression is evaluated. The context has essentially two aspects:

  • Flat context represents additional data, resources and meta-data that are available in the execution environment (regardless of where in the program you access them). Examples include resources like GPS sensors or databases, battery status, framework version and similar.
  • Structural context contains additional meta-data related to variables. This can include provenance (source of the variable value), usage information (how often is the value accessed) or security information (does it contain sensitive data).

As a proponent of statically typed functional languages I believe that a context-aware programming language should capture such context information in the type system and make sure that basic errors (like the ones demonstrated in the four examples above) are ruled out at compile time.

This is essentially the idea behind coeffects. Let’s look at an example showing the idea in (a very simplified) practice and then I’ll say a few words about the theory (which is the main topic of my upcoming PhD thesis).

I don’t know that Tomas would agree but I see his “coeffects,” particularly “meta-data related to variables,” as keying off the subject identity of variables.

Think of it this way: What is the meaning of any value with no express or implied context?

My answer would be that a value without context is meaningless.

Be example, how would you process the value “1” Is it a boolean? Integer? A string?

Imbuing data with “meta-data” (or explicit identity as I prefer) is a first step towards transparent data.

PS: See Petricek and Skeet’s Real-World Functional Programming.

Phoenix: Incubating at Apache!

Filed under: HBase,NoSQL,Phoenix,SQL — Patrick Durusau @ 6:34 pm

Phoenix: Incubating at Apache!

From the webpage:

Phoenix is a SQL skin over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.

Tired of reading already and just want to get started? Take a look at our FAQs, listen to the Phoenix talks from Hadoop Summit 2013 and HBaseConn 2013, and jump over to our quick start guide here.

To see whats supported, go to our language reference. It includes all typical SQL query statement clauses, including SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, etc. It also supports a full set of DML commands as well as table creation and versioned incremental alterations through our DDL commands. We try to follow the SQL standards wherever possible.

Incubating at Apache is no guarantee of success but it does mean sane licensing and a merit based organization/process.

If you are interested in non-NSA corrupted software, consider supporting the Apache Software Foundation.

Sense Preview

Filed under: BigData,Cloud Computing,Collaboration,Sense — Patrick Durusau @ 3:00 pm

Sense is in private beta but you can request an invitation.

Even though the presentation is well rehearsed, this is pretty damned impressive!

The bar for cloud based computing continues to rise.

Follow @senseplatform.

I first saw this at Danny Bickson’s Sense: collaborative data science in the cloud

PS: Learn more about Sense at the 3rd GraphLab Conference.

Get savvy on the latest cloud terminology

Filed under: Cloud Computing,Vocabularies — Patrick Durusau @ 2:38 pm

Get savvy on the latest cloud terminology by Nick Hardiman.

From the post:

As with all technology, some lingo stays popular, while other phrases decline in use. Use this list to find out the newest terminology for all things cloud.

Some cloud terms, like cloudstorming, cloudware and external cloud, are declining in popularity. Other terms are up-and-coming, like vertical cloud.

This list gives all the latest lingo to keep you up-to-date on the most popular terms for all things cloud:

Nick has assembled a list of fifty-one (51) cloud terms.

Could be useful in creating a vocabulary a la schema.org.

As Nick says, the lingo is going to change. Using a microformat and vocabulary can help you maintain access to information.

For example, Nick says:

Cluster

a collection of machines that work together to deliver a customer service. Cloud clusters grow and shrink on-demand. A cloud service provides an API for scaling out a cluster, by adding more machines.

When quoting that, I could say:

<blockquote itemprop="Thing" sameAs="http://en.wikipedia.org/wiki/Cluster_%28computing%29">

Cluster

a collection of machines that work together to deliver a customer service. Cloud clusters grow and shrink on-demand. A cloud service provides an API for scaling out a cluster, by adding more machines. </blockquote>

Which would distinguish (when searching) a “cluster” of computers from one of the other 38 uses of “cluster” found at: en.wikipedia.org/wiki/Cluster

Rather than using “Thing” from schema.org, I really should find or make an extension to that vocabulary for terms in various areas that are relevant to topic maps.

Transparency and Bank Failures

Filed under: Finance Services,Open Data,Transparency — Patrick Durusau @ 11:40 am

The Relation Between Bank Resolutions and Information Environment: Evidence from the Auctions for Failed Banks by João Granja.

Abstract:

This study examines the impact of disclosure requirements on the resolution costs of failed banks. Consistent with the hypothesis that disclosure requirements mitigate information asymmetries in the auctions for failed banks, I find that, when failed banks are subject to more comprehensive disclosure requirements, regulators incur lower costs of closing a bank and retain a lower portion of the failed bank’s assets, while bidders that are geographically more distant are more likely to participate in the bidding for the failed bank. The paper provides new insights into the relation between disclosure and the reorganization of a banking system when the regulators’ preferred plan of action is to promote the acquisition of undercapitalized banks by healthy ones. The results suggest that disclosure regulation policy influences the cost of resolution of a bank and, as a result, could be an important factor in the definition of the optimal resolution strategy during a banking crisis event.

A reminder that transparency needs to be broader than open data in science and government.

In the case of bank failures, transparency lowers the cost of such failures for the public.

Some interests profit from less transparency in bank failures and other interests (like the public) profit from greater transparency.

If bank failure doesn’t sound like a current problem, consider Map of Banks Failed since 2008. (Select from Failed Banks Map (under Quick Links) to display the maps.) U.S. only. Do you know of a similar map for other countries?

Speaking of transparency, it would be interesting to track the formal, financial and social relationships of those acquiring failed bank assets.

You know, the ones that are selling for less than fair market value due to a lack of transparency.

Everpix-Intelligence [Failed Start-up Data Set]

Filed under: Data,Dataset — Patrick Durusau @ 11:09 am

Everpix-Intelligence

From the webpage:

About Everpix

Everpix was started in 2011 with the goal of solving the Photo Mess, an increasingly real pain point in people’s life photo collections, through ambitious engineering and user experience. Our startup was angel and VC funded with $2.3M raised over its lifetime.

After 2 years of research and product development, and although having a very enthousiastic user base of early adopters combined with strong PR momentum, we didn’t succeed in raising our Series A in the highly competitive VC funding market. Unable to continue operating our business, we had to announce our upcoming shutdown on November 5th, 2013.

High-Level Metrics

At the time of its shutdown announcement, the Everpix platform had 50,000 signed up users (including 7,000 subscribers) with 400 millions photos imported, while generating subscription sales of $40,000 / month during the last 3 months (i.e. enough money to cover variable costs, but not the fixed costs of the business).

Complete Dataset

Building a startup is about taking on a challenge and working countless hours on solving it. Most startups do not make it but rarely do they reveal the story behind, leaving their users often frustrated. Because we wanted the Everpix community to understand some of the dynamics in the startup world and why we had to come to such a painful ending, we worked closely with a reporter from The Verge who chronicled our last couple weeks. The resulting article generated extensive coverage and also some healthy discussions around some of our high-level metrics and financials. There was a lot more internal data we wanted to share but it wasn’t the right time or place.

With the Everpix shutdown behind us, we had the chance to put together a significant dataset covering our business from fundraising to metrics. We hope this rare and uncensored inside look at the internals of a startup will benefit the startup community.

Here are some example of common startup questions this dataset helps answering:

  • What are investment terms for consecutive convertible notes and an equity seed round? What does the end cap table look like? (see here)
  • How does a Silicon Valley startup spend its raised money during 2 years? (see here)
  • What does a VC pitch deck look like? (see here)
  • What kinds of reasons do VCs give when they pass? (see here)
  • What are the open rate and click rate of transactional and marketing emails? (see here)
  • What web traffic do various news websites generate? (see here and here)
  • What are the conversion rate from product landing page to sign up for new visitors? (see here)
  • How fast do people purchase a subscription after signing up to a freemium service? (see here and here)
  • Which countries have higher suscription rates? (see here and here)

The dataset is organized as follow:

Every IT startup but especially data oriented startups should work with this data set before launch.

I thought the comments from VCs were particularly interesting.

I would summarize those comments as:

  1. There is a problem.
  2. You have a great idea to solve the problem.
  3. Will consumers pay you to solve the problem?

What evidence do you have on #3?

Bearing in mind that should, ought to, value is obvious, etc., are wishes, not evidence.

I first saw this in a tweet by Emil Eifrem.

January 11, 2014

Reality Gap In War on Terrorism

Filed under: NSA,Security — Patrick Durusau @ 9:06 pm

Andy Oram writes in How did we end up with a centralized Internet for the NSA to mine?:

Having lived through the Boston Marathon bombing, I understand what the NSA claims to be fighting, and I am willing to seek some compromise between their needs for spooking and the protections of the Fourth Amendment to the US Constitution.

You may still remember the Boston Marathon bombing, a couple of malcontents who were already known to the authorities so there was no need for NSA action or any compromise on the Fourth Amendment to the US Constitution.

There is no defense to one or two people committing a criminal act.

Consider bank robberies. Guess where they all occur. Did you say at banks?

Despite knowing where banks are located, the FBI reported for 2011, 5,014 bank robberies.

I am sure anyone who was at any of those robberies was terrified. But we don’t get patted down to go into a bank.

Having a crime on TV (like the Boston Marathon bombing) is no reason to start trading constitutional rights for fictional security.

Crimes happen. Comfort the victims, find suspects, if possible within the bounds of the Constitution and roll on.

If we treat terrorist acts as crimes, just garden variety crimes, our recovery from the hysteria over terrorism will have begun.

ScaleGraph

Filed under: Graphs,ScaleGraph — Patrick Durusau @ 8:39 pm

ScaleGraph

From the webpage:

Recently large-scale graphs with billions of vertices and edges have emerged in a variety of domains and disciplines especially in the forms of social networks, web link graphs, internet topology graphs, etc. Mining these graphs to discover hidden knowledge requires particular middleware and software libraries that can harness the full potential of large-scale computing infrastructures such as super computers.

ScaleGraph is a graph library based on the highly productive X10 programming language. The goal of ScaleGraph is to provide large-scale graph analysis algorithms and efficient distributed computing framework for graph analysts and for algorithm developers, respectively.

If that seems a little sparse (sorry), try the slides from the first ScaleGraph workshop. I promise it will be time well spent.

Download ScaleGraph 2.1.

Some other resources you may find helpful:

http://x10-lang.org/

X10 Research at IBM

X10 Language Specification Version 2.4

I first saw this at Danny Bickson’s ScaleGraph: a new graph processing system.

Registration for the 3rd GraphLab Conference is open!

Filed under: Conferences,GraphLab,Graphs — Patrick Durusau @ 8:22 pm

Registration for the 3rd GraphLab Conference is open! by Danny Bickson.

From the post:

3rd graphlab conference

Join us for a full day of the latest and greatest applied machine learning and big data analytics!

Monday July 21, 2014 at the Nikko Hotel SF. Confirmed speakers (very preliminary list): GraphLab, Google, Trifacta, Datapad, Databricks (Spark), Pandora Internet Radio, Cloudera. Confirmed demos: QuantiFind, bigML, Skytree, YarcData, Saffrom Technology, Franz.

Additional information
Registration

Very cool!

The GraphLab conferences have been a great success.

Besides, it’s in July, San Francisco, + graphs. What more could you want? šŸ˜‰

Why is the NSA grabbing all your private data?

Filed under: Cybersecurity,NSA,Security — Patrick Durusau @ 8:05 pm

Why is the NSA grabbing all your private data? by Daniel Lemire.

From the post:

Snowden revealed to the world that the NSA was systematically spying on all of us. Maybe more critically, we have learned that the NSA is spying on all American citizens. In fact, the NSA is even spying on its own congress. This spying violates the US constitution.

We also know that such spying is ineffective when it comes to stopping terrorists. A cost-benefit analysis shows that the NSA is wasteful.

So why are they doing it?

They are doing it precisely because it is very expensive.
….

To expand Daniel’s point a bit, the war on terrorism isn’t about national security any more than the war on drugs was about reducing drug use.

Both were excuses to spend large amounts of government money with no measurable goals or metrics for success or failure.

Remember the old saying: If you can’t measure it, you can’t manage it.

Spending the funds in secret budget allocations serves to further conceal its lack of value.

But it also points towards a solution to the surveillance/privacy issue.

Congress should pass secret budgets for scientific, medical and humanities research projects that dwarf the war on terror budgets.

Contractors pushing the surveillance agenda will switch over to the larger budgets. Still no measurable results but the projects won’t involve invasion of the privacy of people world wide.

With larger budgets in sight, the supporters of surveillance will move to greener pastures.

Winter 2013 Crawl Data Now Available

Filed under: Common Crawl,Data,Nutch — Patrick Durusau @ 7:42 pm

Winter 2013 Crawl Data Now Available by Lisa Green.

From the post:

The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013 (see previous blog post for more detail on that dataset). The new dataset was collected at the end of 2013, contains approximately 2.3 billion webpages and is 148TB in size. The new data is located in the aws-publicdatasets at /common-crawl/crawl-data/CC-MAIN-2013-48/

In 2013, we made changes to our crawling and post-processing systems. As detailed in the previous blog post, we switched file formats to the international standard WARC and WAT files. We also began using Apache Nutch to crawl ā€“ stay tuned for an upcoming blog post on our use of Nutch. The new crawling method relies heavily on the generous data donations from blekko and we are extremely grateful for blekkoā€™s ongoing support!

In 2014 we plan to crawl much more frequently and publish fresh datasets at least once a month.

Data to play with now and the promise of more to come! Can’t argue with that!

Learning more about Common Crawl’s use of Nutch will be fun as well.

January 10, 2014

Dirty Wind Paths

Filed under: Climate Data,Climate Informatics,Graphics,Visualization,Weather Data — Patrick Durusau @ 6:38 pm

earth wind patterns

Interactive display of wind patterns on the Earth. Turn the globe, zoom in, etc.

Useful the next time a nuclear power plant cooks off.

If you doubt the “next time” part of that comment, review Timeline: Nuclear plant accidents from the BBC.

I count eleven (11) “serious” incidents between 1957 and 2014.

Highly dangerous activities are subject to catastrophic failure. Not every time or even often.

On the other hand, how often is an equivalent to the two U.S. space shuttle failures acceptable with a nuclear power plant?

If I were living nearby or in the wind path from a nuclear accident, I would say never.

You?

According to Dustin Smith at Chart Porn, where I first saw this, the chart updates every three hours.

…Customizable Test Data with Python

Filed under: Data,Python — Patrick Durusau @ 5:15 pm

A Tool to Generate Customizable Test Data with Python by Alec Noller.

From the post:

Sometimes you need a dataset to run some tests – just a bunch of data, anything – and it can be unexpectedly difficult to find something that works. There are some useful and readily-available options out there; for example, Matthew Dubins has worked with the Enron email dataset and a complete list of 9/11 victims.

However, if you have more specific needs, particularly when it comes to format and fitting within the structure of a database, and you want to customize your dataset to test one thing or another in particular, take a look at this Python package called python-testdata used to generate customizable test data. It can be set up to generate names in various forms, companies, addresses, emails, and more. The Github also includes some help to get started, as well as examples for use cases.

I hesitated when I first saw this given the overabundance of free data.

But then with “free” data, if it is large enough, you will have to rely on sampling to gauge the performance of software.

Introducing the hazards and dangers of strange data may not be acceptable in all cases.

WTFViz, ThumbsUpViz, and HelpMeViz

Filed under: Graphics,Visualization — Patrick Durusau @ 4:48 pm

WTFViz, ThumbsUpViz, and HelpMeViz by Robert Kosara.

Robert gives a quick heads up on three new visualization sites:

WTFViz – Visualizations done poorly and/or just wrong.

ThumbsUpViz – Good visualizations.

HelpMeViz – Where you can try to avoid WTFViz and hope to be seen on ThumbsUpViz.

More resources listed in Robert’s post.

Where are your visualizations destined?

PS: You owe it to your users to avoid what you see at WTFViz. I was stunned.

How a New Type of Astronomy…

Filed under: Astroinformatics,Data Mining — Patrick Durusau @ 4:38 pm

How a New Type of Astronomy Investigates the Most Mysterious Objects in the Universe by Sarah Scoles.

From the post:

In 2007, astronomer Duncan Lorimer was searching for pulsars in nine-year-old data when he found something he didnā€™t expect and couldnā€™t explain: a burst of radio waves appearing to come from outside our galaxy, lasting just 5 milliseconds but possessing as much energy as the sun releases in 30 days.

Pulsars, Lorimerā€™s original objects of affection, are strange enough. Theyā€™re as big as cities and as dense as an atomā€™s nucleus, and each time they spin around (which can be hundreds of times per second), they send a lighthouse-like beam of radio waves in our direction. But the single burst that Lorimer found was even weirder, and for years astronomers couldnā€™t even decide whether they thought it was real.

Tick, Tock

The burst belongs to a class of phenomena known as ā€œfast radio transientsā€ ā€“ objects and events that emit radio waves on ultra-short timescales. They could include starsā€™ flares, collisions between black holes, lightning on other planets, and RRATs ā€“ Rotating RAdio Transients, pulsars that only fire up when they feel like it. More speculatively, some scientists believe extraterrestrial civilizations could be flashing fast radio beacons into space.

Astronomersā€™ interest in fast radio transients is just beginning, as computers chop data into ever tinier pockets of time. Scientists call this kind of analysis ā€œtime domain astronomy.ā€ Rather than focusing just on what wavelengths of light an object emits or how bright it is, time domain astronomy investigates how those properties change as the seconds, or milliseconds, tick by.

In non-time-domain astronomy, astronomers essentially leave the telescopeā€™s shutter open for a while, as you would if you were using a camera at night. With such a long exposure, even if a radio burst is strong, it could easily disappear into the background. But with quick sampling ā€“ in essence, snapping picture after picture, like a space stop-motion film ā€“ itā€™s easier to see things that flash on and then disappear.

ā€œThe awareness of these short signals has long existed,ā€ said Andrew Siemion, who searches the time domain for signs of extraterrestrial intelligence. ā€œBut itā€™s only the past decade or so that weā€™ve had the computational capacity to look for them.ā€

Gathering serious data for radio astronomy remains the task of professionals but the reference to mining old data and discovering transients caught my eye.

Among other places to look for more information: National Radio Astronomy Observatory (NRAO).

Or consider Detecting radioastronomical “Fast Radio Transient Events” via an OODTbased metadata processing by Chris Mattmann, et. al. at ApacheCon 2013.

Understandably, professional interest is in real time processing of their data streams but that doesn’t mean treasures aren’t still lurking in historical data.

SIMR – Spark on top of Hadoop

Filed under: Hadoop,Spark — Patrick Durusau @ 3:58 pm

SIMR – Spark on top of Hadoop by Danny Bickson.

From the post:

Just learned from my collaborator Aapo Kyrola that the Spark team have now released a plugin which allows running Spark on top of Hadoop, without installation anything and without administrator privileges. This will probably encourage many more companies to try out Spark, which significantly improves on Hadoop performance.

The tools for data are getting easier to use every day.

Which moves the semantic wall a little closer with each improvement.

Efficiently processing TB of data only to confess it isn’t clear what the data may or may not mean, isn’t going to win IT any friends.

xkcd 1313: Something is Wrong on the Internet!

Filed under: Python,Regex — Patrick Durusau @ 3:41 pm

xkcd 1313: Something is Wrong on the Internet!

Serious geekdom here!

An xkcd comic inspires an algorithm that generates a regex to extract winners from U.S. presidential elections. (Applicable to other lists as well.)

Remembering that some U.S. presidents both won and lost races for the presidency.

A very clever piece of work. At the same time, I must have the winner/loser lists in order to generate the regex.

So good exercise but I can’t apply it beyond the lists I used to generate the regex.

Yes?

BTW, do make a trip by Regex Golf to try your hand at writing regexes against different lists.

Getting Started with OrientDB

Filed under: Graphs,OrientDB — Patrick Durusau @ 3:22 pm

Getting Started with OrientDB by Petter Graff.

From the post:

In my previous blog post I praised the open-source database Orient DB. Iā€™ve received many email from people since asking me questions about Orient DB or telling me that theyā€™d love to try Orient DB out but donā€™t know where to start.

I though it is time for me to contribute to the Orient DB community and write a short tutorial for exactly this audience. Ideally, I’d like to ensure that if you sit through the videos below, you’d at least have an informed view of whether Orient DB is of interest to you, your project and your organization.

Think of this blog post as the first installment. I may come back with some more advanced tutorials later. This tutorial shows you how to:

  • Download and install Orient DB
  • Play with the Orient DB studio and perform queries on the Grateful Dead database provided with the community edition
  • Create your own database (again using Orient DB studio, perhaps later Iā€™ll provide a tutorial to show how to do this from Node or Java)
  • Create server-side functions
  • Perform graph- and document-based queries over Orient DB

I hope this tutorial gives you a good idea of how Orient DB works and some of its power. I also hope it will help illustrate the reasons for my enthusiasm in the previous post where I argued the case for Orient DB. In this post Iā€™ll focus on how you get started.

My only quibble with this post is:

I would assume only a percentage of the readers are familiar with Grateful Dead, so let me give you a few sentences about them.

Really? I guess it depends on what you mean by “percentage.” 1%? 2% šŸ˜‰

Excellent introduction! Forward and watch for future installments.

CouchDB + ElasticSearch on Ubuntu 13.10 VPS

Filed under: CouchDB,ElasticSearch — Patrick Durusau @ 2:16 pm

How To Set Up CouchDB with ElasticSearch on an Ubuntu 13.10 VPS by Cooper Thompson.

From the post:

CouchDB is a NoSQL database that stores data as JSON documents. It is extremely helpful in situations where a schema would cause headaches and a flexible data model is required. CouchDB also supports master-master continuous replication, which means data can be continuously replicated between two databases without having to setup a complex system of master and slave databases.

ElasticSearch is a full-text search engine that indexes everything and makes pretty much anything searchable. This works extremely well with CouchDB because one of the limitations of CouchDB is that for all queries you have to either know the document ID or you have to use map/reduce.

This looks like a very useful installation guide if you are just starting with CouchDB and/or ElasticSearch.

I say “looks like” because the article is undated. The only way I know it is relatively recent is that it refers to ElasticSearch 90.8 and the latest release of ElasticSearch is 90.10.

Dating posts, guides, etc. really isn’t that hard and it helps readers avoid out-dated material.

Faunus & Titan 0.4.2 Released

Filed under: Faunus,Graphs,Titan — Patrick Durusau @ 1:54 pm

Faunus & Titan 0.4.2 Released by Dan LaRocque.

From the post:

Aurelius is pleased to announce the release of Titan and Faunus 0.4.2.

This is mainly a bugfix release. Of particular note is a pair of Titan bugs involving deletion of edges with multiple properties and of edges labeled with reverse-ordered sort keys. Titan also gets a few new configuration options and expanded Metrics coverage in this release.

Downloads:

* Titan: https://github.com/thinkaurelius/titan/wiki/Downloads#titan-042
*Faunus: https://github.com/thinkaurelius/faunus/wiki/Downloads

Something for your weekend!

The Cloudera Developer Newsletter: Itā€™s For You!

Filed under: BigData,Cloudera,Hadoop — Patrick Durusau @ 12:01 pm

The Cloudera Developer Newsletter: Itā€™s For You! by Justin Kestelyn.

From the post:

Developers and data scientists, weā€™re realize youā€™re special ā€“ as are operators and analysts, in their own particular ways.

For that reason, we are very happy to kick off 2014 with a new free service designed for you and other technical end-users in the Cloudera ecosystem: the Cloudera Developer Newsletter.

This new email-based newsletter contains links to a curated list of new how-toā€™s, docs, tools, engineer and community interviews, training, projects, conversations, videos, and blog posts to help you get a new Apache Hadoop-based enterprise data hub deployment off the ground, or get the most value out of an existing deployment. Look for a new issue every month!

All you have you to do is click the button below, provide your name and email address, tick the ā€œDeveloper Communityā€ check-box, and submit. Done! (Of course, you can also opt-in to several other communication channels if you wish.)

The first newsletter is due to appear at the end of January, 2014.

Given the quality of other Cloudera resources I look forward to this newsletter with anticipation!

« Newer PostsOlder Posts »

Powered by WordPress