Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 31, 2013

“You Know Because I Know”

Filed under: Graphs,Multidimensional,Networks — Patrick Durusau @ 5:06 pm

“You Know Because I Know”: a Multidimensional Network Approach to Human Resources Problem by Michele Coscia, Giulio Rossetti, Diego Pennacchioli, Damiano Ceccarelli, Fosca Giannotti.

Abstract:

Finding talents, often among the people already hired, is an endemic challenge for organizations. The social networking revolution, with online tools like Linkedin, made possible to make explicit and accessible what we perceived, but not used, for thousands of years: the exact position and ranking of a person in a network of professional and personal connections. To search and mine where and how an employee is positioned on a global skill network will enable organizations to find unpredictable sources of knowledge, innovation and know-how. This data richness and hidden knowledge demands for a multidimensional and multiskill approach to the network ranking problem. Multidimensional networks are networks with multiple kinds of relations. To the best of our knowledge, no network-based ranking algorithm is able to handle multidimensional networks and multiple rankings over multiple attributes at the same time. In this paper we propose such an algorithm, whose aim is to address the node multi-ranking problem in multidimensional networks. We test our algorithm over several real world networks, extracted from DBLP and the Enron email corpus, and we show its usefulness in providing less trivial and more flexible rankings than the current state of the art algorithms.

Although framed in a human resources context, it isn’t much of a jump to see this work as applicable to other multidimensional networks.

Including multidimensional networks of properties that define subject identities.

Reidentification as Basic Science

Filed under: Identification,Identity,Reidentification — Patrick Durusau @ 4:41 pm

Reidentification as Basic Science by Arvind Narayanan.

From the post:

What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?

In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.

Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.

(…)

A nice introduction to the major contours of reidentification, which the IT Law Wiki defines as:

Data re-identification is the process by which personal data is matched with its true owner.

Although in topic map speak I would usually say that personal data was used to identify its owner.

In a reidentification context, some effort has been made to obscure that relationship, so matching may be the better usage.

Depending on your data sources, something you may encounter when building a topic map.

I first saw this at Pete Warden’s Five short links.

New Milestone Release Neo4j 2.0.0-M03

Filed under: Graphs,Neo4j — Patrick Durusau @ 4:17 pm

New Milestone Release Neo4j 2.0.0-M03 by Michael Hunger.

From the post:

The latest M03 milestone release of Neo4j 2.0 is as you expected all about improvements to Cypher. This blog post also discusses some changes made in the last milestone (M02) which we didn’t fully cover.

MERGE

Cypher now contains a MERGE clause which is pretty big: It will be replacing CREATE UNIQUE as it also takes indexes and labels into accounts and can even be used for single node creation. MERGE either matches the graph and returns what is there (one or more results) or if it doesn’t find anything it creates the path given. So after the MERGE operation completes, Neo4j guarantees that the declared pattern is there.

We also added additional clauses to the MERGE statement which allow you to create or update properties as a function of whether the node was matched or created. Please note that — as patterns can contain multiple named nodes and relationships — you will have to specify the element for which you want to trigger an update operation upon creation or match.

MERGE (keanu:Person { name:'Keanu Reeves' })
ON CREATE keanu SET keanu.created = timestamp()
ON MATCH  keanu SET keanu.lastSeen = timestamp()
RETURN keanu

We put MERGE out to mainly collect feedback on the syntax and usage, there are still some caveats, like not grabbing locks for unique creation so you might end up with duplicate nodes for now. That will all be fixed by the final release.

Going along with MERGE, MATCH now also supports single node patterns, both with and without labels.

(…)

MERGE is definitely something to investigate in this milestone release.

You need to also take a close look at labels.

What issues, if any, do you see with the label mechanism?

I see several but will cover them early next week. Work up your list (if any) to see if we reach similar conclusions.

Freeing Information From Its PDF Chains

Filed under: node-js,PDF — Patrick Durusau @ 3:58 pm

Parsing PDFs at Scale with Node.js, PDF.js, and Lunr.js by Gary Sieling.

From the post:

Much information is trapped inside PDFs, and if you want to analyze it you’ll need a tool that extracts the text contents. If you’re processing many PDFs (XX millions), this takes time but parallelizes naturally. I’ve only seen this done on the JVM, and decided to do a proof of concept with new Javascript tools. This runs Node.js as a backend and uses PDF.js, from Mozilla Labs, to parse PDFs. A full-text index is also built, the beginning of a larger ingestion process.

I like the phrase “[m]uch information is trapped inside PDFs….”

Despite window dressing executive orders, information is going to continue to be trapped inside PDFs.

What information do you want to free from its PDF chains?

I first saw this at DZone.

Creating Effective Slides

Filed under: Communication,Presentation — Patrick Durusau @ 3:35 pm

A lecture given by Jean-luc Doumont on April 4, 2013 – Clark Center Stanford Univeristy.

Description:

Those of us who frequently attend presentations probably agree that most slides out there are ineffective, often detracting from what presenters are saying instead of enhancing their presentation. Slides have too much text for us to want to read them, or not enough for us to understand the point. They impress us with colors, clip art, and special effects, but not with content. As a sequence of information chunks, they easily create a feeling of tedious linearity. Based on Dr Doumont’s book, Trees, maps, and theorems about “effective communication for rational minds,” this lecture will discuss how to create more effective slides. Building on three simple yet solid principles, it will establish what (not) to include on a slide and why, how to optimize the slide’s layout to get the message across effectively, and how to use slides appropriately when delivering the presentation.

A truly delightful presentation on creating effective slides.

Even has three laws:

  1. Adapt to your audience
  2. Maximize the signal/noise ratio
  3. Use effective redundancy

Should be required viewing for conference presenters, at least annually.

Website with more resources: Principiæ.

I first saw this at Creating effective slides: Design, Construction, and Use in Science by Bruce Berriman.

Visualizing the News with VivaGraphJS

Filed under: AlchemyAPI,DBpedia,Graphs,Neo4j,Visualization — Patrick Durusau @ 2:17 pm

Visualizing the News with Vivagraph.js by Max De Marzi.

From the post:

Today I want to introduce you to VivaGraphJS – a JavaScript Graph Drawing Library made by Andrei Kashcha of Yasiv. It supports rendering graphs using WebGL, SVG or CSS formats and currently supports a force directed layout. The Library provides an API which tracks graph changes and reflect changes on the rendering surface which makes it fantastic for graph exploration.

The post includes AlchemyAPI (entity extraction), DBpedia (additional information), Feedzilla (news feeds), and Neo4j (graphs).

The technology rocks but the content, well, your mileage will vary.

Big Data RDF Store Benchmarking Experiences

Filed under: Benchmarks,BigData,RDF — Patrick Durusau @ 1:48 pm

Big Data RDF Store Benchmarking Experiences by Peter Boncz.

From the post:

Recently we were able to present new BSBM results, testing the RDF triple stores Jena TDB, BigData, BIGOWLIM and Virtuoso on various data sizes. These results extend the state-of-the-art in various dimensions:

  • scale: this is the first time that RDF store benchmark results on such a large size have been published. The previous published BSBM results published were on 200M triples, the 150B experiments thus mark a 750x increase in scale.
  • workload: this is the first time that results on the Business Intelligence (BI) workload are published. In contrast to the Explore workload, which features short-running “transactional” queries, the BI workload consists of queries that go through possibly billions of triples, grouping and aggregating them (using the respective functionality, new in SPARQL1.1).
  • architecture: this is the first time that RDF store technology with cluster functionality has been publicly benchmarked.

Clusters are great but also difficult to use.

Peter’s post is one of those rare ones that exposes the second half of that statement.

Impressive hardware and results.

Given the hardware and effort required, are we pursuing “big data” for the sake of “big data?”

Not just where RDF is concerned but in general?

Shouldn’t the first question always be: What is the relevant data?

If you can’t articulate the relevant data, isn’t that a commentary on your understanding of the problem?

Scala 2013 Overview

Filed under: Functional Programming,Programming,Scala — Patrick Durusau @ 10:09 am

Scala 2013 Overview by Sagie Davidovich.

An impressive set of slides on Scala.

Work through all of them and you won’t be a Scala expert but well on your way.

I first saw this at Nice Scala Tutorial by Danny Bickson.

Going Bright… [Hack Shopping Mall?]

Filed under: Cybersecurity,Marketing,Security — Patrick Durusau @ 9:40 am

Going Bright: Wiretapping without Weakening Communications Infrastructure by Steven M. Bellovin, Matt Blaze, Sandy Clark, and Susan Landau (unofficial version). (Steven M. Bellovin, Matt Blaze, Sandy Clark, Susan Landau, “Going Bright: Wiretapping without Weakening Communications Infrastructure,” IEEE Security & Privacy, vol. 11, no. 1, pp. 62-72, Jan.-Feb. 2013, doi:10.1109/MSP.2012.138)

Abstract:

Mobile IP-based communications and changes in technologies, including wider use of peer-to-peer communication methods and increased deployment of encryption, has made wiretapping more difficult for law enforcement, which has been seeking to extend wiretap design requirements for digital voice networks to IP network infrastructure and applications. Such an extension to emerging Internet-based services would create considerable security risks as well as cause serious harm to innovation. In this article, the authors show that the exploitation of naturally occurring weaknesses in the software platforms being used by law enforcement’s targets is a solution to the law enforcement problem. The authors analyze the efficacy of this approach, concluding that such law enforcement use of passive interception and targeted vulnerability exploitation tools creates fewer security risks for non-targets and critical infrastructure than do design mandates for wiretap interfaces.

The authors argue against an easy-on-ramp for law enforcement to intercept digital communications.

What chance is there a non-law enforcement person could discover such back doors and also be so morally depraved as to take advantage of them?

What could possibly go wrong with a digital back door proposal? 😉

No lotteries for 0-day vulnerabilities but the article does mention:

Secunia, https://secunia.com/community/advisories

VulnerabilityLab, www.vulnerability-lab.com

Vupen, www.vupen.com/english/services/solutions-gov.php

ZDI, http://dvlabs.tippingpoint.com/advisories/disclosure-policy

as offering

subscription services that make available varying levels of access information about 0-day vulnerabilities to their clients.

As far as the FBI is concerned, they should adapt to changing technology and stop being a drag on communications technology.

You do know they still don’t record interviews with witnesses?

How convenient when it comes time for a trial on obstruction of justice or perjury. All the evidence is an agent’s notes of the conversation.

BTW, in case you are looking for a cybersecurity/advertising opportunity, you have seen those services that gather up software packages for comparison price shopping?

Why not a service that gathers up software packages and displays unresolved (and/or historical) hacks on those products?

With ads from security services, hackers, etc.

A topic map powered hack shopping mall as it were.

May 30, 2013

Hadoop Tutorials: Real Life Use Cases in the Sandbox

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 7:56 pm

Hadoop Tutorials: Real Life Use Cases in the Sandbox by Cheryle Custer.

Six (6) new tutorials from Hortonworks:

  • Tutorial 6 – Loading Data into the Hortonworks Sandbox
  • Tutorials 7 & 11 – Installing the ODBC Driver in the Hortonworks Sandbox (Windows and Mac)
  • Tutorials 8 & 9 – Accessing and Analyzing Data in Excel
  • Tutorial 10 – Visualizing Clickstream Data

You have done the first five (5).

Yes?

Hortonworks Data Platform 1.3 Release

Filed under: Hadoop,Hortonworks,MapReduce — Patrick Durusau @ 7:51 pm

Hortonworks Data Platform 1.3 Release: The community continues to power innovation in Hadoop by Jeff Sposetti.

From the post:

HDP 1.3 release delivers on community-driven innovation in Hadoop with SQL-IN-Hadoop, and continued ease of enterprise integration and business continuity features.

Almost one year ago (50 weeks to be exact) we released Hortonworks Data Platform 1.0, the first 100% open source Hadoop platform into the marketplace. The past year has been dynamic to say the least! However, one thing has remained constant: the steady, predictable cadence of HDP releases. In September 2012 we released 1.1, this February gave us 1.2 and today we’re delighted to release HDP 1.3.

HDP 1.3 represents yet another significant step forward and allows customers to harness the latest innovation around Apache Hadoop and its related projects in the open source community. In addition to providing a tested, integrated distribution of these projects, HDP 1.3 includes a primary focus on enhancements to Apache Hive, the de-facto standard for SQL access in Hadoop as well as numerous improvements that simplify ease of use.

Whatever the magic dust is for a successful open source project, the Hadoop community has it in abundance.

Distributing the Edit History of Wikipedia Infoboxes

Filed under: Data,Dataset,Wikipedia — Patrick Durusau @ 7:44 pm

Distributing the Edit History of Wikipedia Infoboxes by Enrique Alfonseca.

From the post:

Aside from its value as a general-purpose encyclopedia, Wikipedia is also one of the most widely used resources to acquire, either automatically or semi-automatically, knowledge bases of structured data. Much research has been devoted to automatically building disambiguation resources, parallel corpora and structured knowledge from Wikipedia. Still, most of those projects have been based on single snapshots of Wikipedia, extracting the attribute values that were valid at a particular point in time. So about a year ago we compiled and released a data set that allows researchers to see how data attributes can change over time.

(…)

For this reason, we released, in collaboration with Wikimedia Deutschland e.V., a resource containing all the edit history of infoboxes in Wikipedia pages. While this was already available indirectly in Wikimedia’s full history dumps, the smaller size of the released dataset will make it easier to download and process this data. The released dataset contains 38,979,871 infobox attribute updates for 1,845,172 different entities, and it is available for download both from Google and from Wikimedia Deutschland’s Toolserver page. A description of the dataset can be found in our paper WHAD: Wikipedia Historical Attributes Data, accepted for publication at the Language Resources and Evaluation journal.

How much data do you need beyond the infoboxes of Wikipedia?

And knowing what values were in the past … isn’t that like knowing prior identifiers for subjects?

Medicare Provider Charge Data

Filed under: Dataset,Health care,Medical Informatics — Patrick Durusau @ 2:47 pm

Medicare Provider Charge Data

From the webpage:

As part of the Obama administration’s work to make our health care system more affordable and accountable, data are being released that show significant variation across the country and within communities in what hospitals charge for common inpatient services.

The data provided here include hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using the Medicare Severity Diagnosis Related Group (MS-DRG) for Fiscal Year (FY) 2011. These DRGs represent almost 7 million discharges or 60 percent of total Medicare IPPS discharges.

Hospitals determine what they will charge for items and services provided to patients and these charges are the amount the hospital bills for an item or service. The Total Payment amount includes the MS-DRG amount, bill total per diem, beneficiary primary payer claim payment amount, beneficiary Part A coinsurance amount, beneficiary deductible amount, beneficiary blood deducible amount and DRG outlier amount.

For these DRGs, average charges and average Medicare payments are calculated at the individual hospital level. Users will be able to make comparisons between the amount charged by individual hospitals within local markets, and nationwide, for services that might be furnished in connection with a particular inpatient stay.

Data are being made available in Microsoft Excel (.xlsx) format and comma separated values (.csv) format.

Inpatient Charge Data, FY2011, Microsoft Excel version
Inpatient Charge Data, FY2011, Comma Separated Values (CSV) version

A nice start towards a useful data set.

Next step would be tying identifiable physicians with ordered medical procedures and tests.

The only times I have arrived at a hospital by ambulance, I never thought to ask for a comparison of their prices with other local hospitals. Nor did I see any signs advertising discounts on particular procedures.

Have you?

Let’s not pretend medical care is a consumer market, where “consumers” are penalized for not being good shoppers.

I first saw this at Nathan Yau’s Medicare provider charge data released.

Getting Started with ElasticSearch: Part 1 – Indexing

Filed under: ElasticSearch,Lucene,Solr — Patrick Durusau @ 2:35 pm

Getting Started with ElasticSearch: Part 1 – Indexing by Florian Hopf.

From the post:

ElasticSearch is gaining a huge momentum with large installations like Github and Stackoverflow switching to it for its search capabilities. Its distributed nature makes it an excellent choice for large datasets with high availability requirements. In this 2 part article I’d like to share what I learned building a small Java application just for search.

The example I am showing here is part of an application I am using for talks to show the capabilities of Lucene, Solr and ElasticSearch. It’s a simple webapp that can search on user group talks. You can find the sources on GitHub.

Some experience with Solr can be helpful when starting with ElasticSearch but there are also times when it’s best to not stick to your old knowledge.

As rapidly as Solr, Lucene and ElasticSearch are developing, old knowledge can be an issue for any of them.

Writing a Minimal Working Example (MWE) in R

Filed under: Programming,R — Patrick Durusau @ 2:24 pm

Writing a Minimal Working Example (MWE) in R by Jared Knowles.

From the post:

How to Ask for Help using R

The key to getting good help with an R problem is to provide a minimally working reproducible example (MWRE). Making an MWRE is really easy with R, and it will help ensure that those helping you can identify the source of the error, and ideally submit to you back the corrected code to fix the error instead of sending you hunting for code that works. To have an MWRE you need the following items:

  • a minimal dataset that produces the error
  • the minimal runnable code necessary to produce the data, run on the dataset provided
  • the necessary information on the used packages, R version, and system
  • a seed value, if random properties are part of the code

Let’s look at the tools available in R to help us create each of these components quickly and easily.

R specific but the general principles apply to any support question.

Pointers to other language/software specific instructions for support questions?

Even heavier requirements obtain in development environments:

stacktrace

Stepping up to Big Data with R and Python…

Filed under: BigData,Python,R — Patrick Durusau @ 2:08 pm

Stepping up to Big Data with R and Python: A Mind Map of All the Packages You Will Ever Need by Abhijit Dasgupta.

From the post:

On May 8, we kicked off the transformation of R Users DC to Statistical Programming DC (SPDC) with a meetup at iStrategyLabs in Dupont Circle. The meetup, titled “Stepping up to big data with R and Python,” was an experiment in collective learning as Marck and I guided a lively discussion of strategies to leverage the “traditional” analytics stack in R and Python to work with big data.

[images omitted]

R and Python are two of the most popular open-source programming languages for data analysis. R developed as a statistical programming language with a large ecosystem of user-contributed packages (over 4500, as of 4/26/2013) aimed at a variety of statistical and data mining tasks. Python is a general programming language with an increasingly mature set of packages for data manipulation and analysis. Both languages have their pros and cons for data analysis, which have been discussed elsewhere, but each is powerful in its own right. Both Marck and I have used R and Python in different situations where each has brought something different to the table. However, since both ecosystems are very large, we didn’t even try to cover everything, and we didn’t believe that any one or two people could cover all the available tools. We left it to our attendees (and to you , our readers) to fill in the blanks with favorite tools in R and Python for particular data analytic tasks.

See the post for links to preliminary maps of the two ecosystems.

I like the maps but the background seems distracting.

You?

re3data.org

Filed under: Data,Dataset,Semantic Diversity — Patrick Durusau @ 1:12 pm

re3data.org

From the post:

An increasing number of universities and research organisations are starting to build research data repositories to allow permanent access in a trustworthy environment to data sets resulting from research at their institutions. Due to varying disciplinary requirements, the landscape of research data repositories is very heterogeneous. This makes it difficult for researchers, funding bodies, publishers, and scholarly institutions to select an appropriate repository for storage of research data or to search for data.

The re3data.org registry allows the easy identification of appropriate research data repositories, both for data producers and users. The registry covers research data repositories from all academic disciplines. Information icons display the principal attributes of a repository, allowing users to identify the functionalities and qualities of a data repository. These attributes can be used for multi-faceted searches, for instance to find a repository for geoscience data using a Creative Commons licence.

By April 2013, 338 research data repositories were indexed in re3data.org. 171 of these are described by a comprehensive vocabulary, which was developed by involving the data repository community (http://doi.org/kv3).

The re3data.org search at can be found at: http://www.re3data.org
The information icons are explained at: http://www.re3data.org/faq

Does this sound like any of these?:

DataOne

The Dataverse Network Project

IOGDS: International Open Government Dataset Search

PivotPaths: a Fluid Exploration of Interlinked Information Collections

Quandl [> 2 million financial/economic datasets]

Just to name five (5) that came to mind right off hand?

Addressing the heterogeneous nature of data repositories by creating another, semantically different data repository, seems like a non-solution to me.

What would be useful would be to create a mapping of this “new” classification, which I assume works for some group of users, against the existing classifications.

That would allow users of the “new” classification to access data in existing repositories, without having to learn their classification systems.

The heterogeneous nature of information is never vanquished but we can incorporate it into our systems.

May 29, 2013

‘Strongbox’ for Leakers Offers Imperfect Anonymity

Filed under: Cybersecurity,Security — Patrick Durusau @ 4:20 pm

‘Strongbox’ for Leakers Offers Imperfect Anonymity by Jeremy Hsu.

From the post:

Anonymous sources face a huge challenge in leaking sensitive information to journalists without leaving a digital trail for government investigators to follow. The New Yorker aims to make anonymous leaks feel slightly more secure with its new "Strongbox" solution, but the system's security still ultimately depends upon the caution of its users.

The New Yorker's drop box allows sources to upload documents anonymously and provides two-way communication between sources and journalists, according to The Guardian.

Sources are able to upload documents anonymously through the Tor network onto servers that will be kept separate from the New Yorker's main computer system. Leakers are then given a unique code name that allows New Yorker reporters or editors to contact them through messages left on Strongbox.

Strongbox is based on an open-source, anonymous in-box system called DeadDrop—the brainchild of security journalist Kevin Poulsen and Internet pioneer and activist Aaron Swartz from almost two years ago. Poulsen described how Swartz had created a stable-enough version of the DeadDrop code by December 2012 for them to set a tentative launch date. On 11 January 2013, Swartz killed himself as he faced the possibility of a a 35-year prison sentence for downloading 4 million articles from the JSTOR academic database.

The Strongbox launch on 15 May comes at a time when the U.S. government has shown itself willing to go after information leakers—and possibly reporters—by any means necessary. The Associated Press has reported on how the Justice Department secretly obtained phone logs used by AP editors and reporters. In another case, a Fox News chief correspondent may face criminal charges for reporting on a classified CIA analysis of North Korea provided by a source in the State Department.

Jeremy goes on to point out that systems are only as secure as users are careful to use them properly.

And there is a technical burden to following all the rules, rules which many of us as people, have a tendency to forget.

Just in case you are thinking about a leaking lottery like I mentioned the other day.

From a subject identity perspective, Tor is obscuring the traffic trail between your computer and another.

Curious if the same principles could be applied to content?

The only difference between a dictionary and a super-secret document is one of ordering and repetition.

So what if a world wide adversary is scooping up all traffic.

If it doesn’t know the correct order, it could contain sports results or plans for a homemade FAE.

Rather than relying on encryption and/or point-to-point delivery of content by a twisting trail, why not split the content into twisting trails for reassembly at its destination?

If I had a working demo I would already be offering it for sale but I think the principle is sound.

Suggestions?

Why Would #1 Spy on #2?

Filed under: HPC,News,Programming — Patrick Durusau @ 2:31 pm

Confirmation: China has a 50+ Petaflop system.

That confirmation casts even more doubt on the constant drum roll of “China spying on the U.S.” allegations.

Who wants to spy on second place technology?

The further U.S.-based technology falls behind, due to the lack of investment in R&D by government and industry, expect the the hysterical accusations against China and others to ramp up.

Can’t possibly be that three month profit goals and lowering government spending led to a self-inflicted lack of R&D.

Must be someone stealing the technology we didn’t invest to invent. Has to be. 😉

The new Chinese system is a prick to the delusional American Exceptionalism balloon.

There will be others.

K-nearest neighbour search for PostgreSQL [Data Types For Distance Operator]

Filed under: K-Nearest-Neighbors,PostgreSQL,Similarity — Patrick Durusau @ 10:30 am

K-nearest neighbour search for PostgreSQL by Oleg Bartunov and Teodor Sigaev.

Excellent presentation from PGCon-2010 on the KNN index type in Postgres.

And an exception to the rule about wordy slides.

Or at least wordy slides are better for non-attendees.

KNN uses the <-> distance operator.

And the slides say:

distance operator, should be provided for data type

Looking at the Postgre documentation (9.3, but same for 9.1 and 9.2), I read:

In addition to the typical B-tree search operators, btree_gist also provides index support for <> (“not equals”). This may be useful in combination with an exclusion constraint, as described below.

Also, for data types for which there is a natural distance metric, btree_gist defines a distance operator <->, and provides GiST index support for nearest-neighbor searches using this operator. Distance operators are provided for int2, int4, int8, float4, float8, timestamp with time zone, timestamp without time zone, time without time zone, date, interval, oid, and money.

What collective subjects would you like to gather up using the defined data types for the distance operator?

How would you represent properties to take advantage of the defined data types?

Are there other data types that you would define for the distance operator?

May 28, 2013

Cascading and Scalding

Filed under: Cascading,MapReduce,Pig,Scalding — Patrick Durusau @ 4:17 pm

Cascading and Scalding by Danny Bickson.

Danny has posted some links for Cascading and Scalding, alternatives to Pig.

I continue to be curious about documentation of semantics for Pig scripts or any of its alternatives.

Or for that matter, in any medium to large-sized mapreduce shop, how do you index those semantics?

Why Tumblr Was a Massive Steal for Yahoo [There will be a test.]

Filed under: Marketing,Topic Maps — Patrick Durusau @ 4:07 pm

Why Tumblr Was a Massive Steal for Yahoo by Adam Rifkin.

Adam makes two critical points for topic maps marketing:

First, Tumblr is an interest not a social graph. That is people are looking for context of interest to them.

Second, as Adam writes:

Writers have time but no money. Certain groups are going to be overrepresented: Students, stay-at-home moms, the underemployed, retirees. Epinions, which paid for product reviews, especially ran into issues with writers whose relationship to reviewed products lay more in the realm of fantasy than reality. Writers are also going to have the time and emotional commitment to give your site a lot of feedback about their needs and desires … many of which will be counter to the best interests of the business.

Readers have money but no time. They don’t want to spend hours combing the Internet for photos of vintage jewelry. They want to see a picture of a watch they like, and buy it now. If readers don’t find your content valuable, they’re not going to send you a long email about what they don’t like. They’ll just silently hit the back button and get gone.

Test:

In marketing a topic map, which of the following are more important?:

  1. Writers can contribute their input to a public topic map interface?
  2. Readers can purchase access to high quality content?

You have until your VC funding runs out to decide.

Four and Twenty < / > ! Baked in a Pie…

Filed under: Conferences,XML,XML Database,XML Query Rewriting,XML Schema,XQuery,XSLT — Patrick Durusau @ 2:53 pm

Balisage 2013 program is online!

From Tommie Usdin’s email:

Balisage is an annual conference devoted to the theory and practice of descriptive markup and related technologies for structuring and managing information. Participants typically include XML users, librarians, archivists, computer scientists, XSLT and XQuery programmers, implementers of XSLT and XQuery engines and other markup-related software, Topic-Map enthusiasts, semantic-Web evangelists, members of the working groups which define the specifications, academics, industrial researchers, representatives of governmental bodies and NGOs, industrial developers, practitioners, consultants, and the world’s greatest concentration of markup theorists. Discussion is open, candid, and unashamedly technical.

Major features of this year’s program include several challenges to the fundamental infrastructure of XML; case studies from government, academia, and publishing; approaches to overlapping data structures; discussions of XML’s political fortunes; and technical papers on XML, XForms, XQuery, REST, XSLT, RDF, XSL-FO, XSD, the DOM, JSON, and XPath.

Attending Balisage even once will keep you from repeating mistakes in language design.

Attending Balisage twice will mark you as a markup expert.

Attending Balisage three or more times, well, this is an open channel so we can’t go there.

But you should go to Balisage!

Send your pics from Saint Catherine Street!

The Charm of Being Small (4K)

Filed under: Graphics,Programming,Visualization — Patrick Durusau @ 10:51 am

white one – Making of

From the post:

white one is my first 4k intro and my first serious demoscene production (as far as something like that can be serious). I’m new to C coding and to sizecoding in particular, so there were a lot of things to be learned which I’ll try to summarize here. Download and run the executable (nvidia only, sorry) or watch the video capture first:

A 4k intro is a executable file of at most 4 kilobytes (4096 bytes) that generates video and audio. That is, it puts something moving on your screen and something audible on your speakers. The finished product runs for a few minutes, has some coherent theme and design and ideally, sound and visual effects complement each other. On top of that, it’s a technical challenge: It’s impossible to store 3D models, textures or sound samples in 4 kilobytes, so you have to generate these things at runtime if you need them.

Overwhelmed by the impossibility of all this I started messing around.

I had been lurking on a few demoparties, but never released anything nontrivial – i do know some OpenGL, but i am normally coding Lisp which tends to produce executables that are measured in megabytes. Obviously, that had to change if i wanted to contribute a small intro. Playing with the GL Shading Language had always been fun for me, so it was clear that something shader-heavy was the only option. And I had some experience with C from microcontroller hacking.

(…)

While researching visualizations I encountered this jewel.

Good summer fun and perhaps an incentive to have coding catch up with hardware.

Lots of hardware can make even poor code run acceptably, but imagine good code with lots of hardware.

BTW, as an additional resource, see: demoscene.info.

Similarity for Topic Maps?

Filed under: Graphs,Similarity — Patrick Durusau @ 10:10 am

The Weisfeiler-Lehman algorithm and estimation on graphs by Alex Smola.

From the post:

Imagine you have two graphs \(G\) and \(G’\) and you’d like to check how similar they are. If all vertices have unique attributes this is quite easy:

FOR ALL vertices \(v \in G \cup G’\) DO

  • Check that \(v \in G\) and that \(v \in G’\)
  • Check that the neighbors of v are the same in \(G\) and \(G’\)

This algorithm can be carried out in linear time in the size of the graph, alas many graphs do not have vertex attributes, let alone unique vertex attributes. In fact, graph isomorphism, i.e. the task of checking whether two graphs are identical, is a hard problem (it is still an open research question how hard it really is). In this case the above algorithm cannot be used since we have no idea which vertices we should match up.

The Weisfeiler-Lehman algorithm is a mechanism for assigning fairly unique attributes efficiently. Note that it isn’t guaranteed to work, as discussed in this paper by Douglas – this would solve the graph isomorphism problem after all. The idea is to assign fingerprints to vertices and their neighborhoods repeatedly. We assume that vertices have an attribute to begin with. If they don’t then simply assign all of them the attribute 1. Each iteration proceeds as follows:

(…)

Something a bit more sophisticated than comparing canonical representations in syntax.

Improving the security of your SSH private key files

Filed under: Cryptography,Cybersecurity,Security — Patrick Durusau @ 9:57 am

Improving the security of your SSH private key files by Martin Kleppmann.

From the post:

Ever wondered how those key files in ~/.ssh actually work? How secure are they actually?

As you probably do too, I use ssh many times every single day — every git fetch and git push, every deploy, every login to a server. And recently I realised that to me, ssh was just some crypto voodoo that I had become accustomed to using, but I didn’t really understand. That’s a shame — I like to know how stuff works. So I went on a little journey of discovery, and here are some of the things I found.

When you start reading about “crypto stuff”, you very quickly get buried in an avalanche of acronyms. I will briefly mention the acronyms as we go along; they don’t help you understand the concepts, but they are useful in case you want to Google for further details.

Quick recap: If you’ve ever used public key authentication, you probably have a file ~/.ssh/id_rsa or ~/.ssh/id_dsa in your home directory. This is your RSA/DSA private key, and ~/.ssh/id_rsa.pub or ~/.ssh/id_dsa.pub is its public key counterpart. Any machine you want to log in to needs to have your public key in ~/.ssh/authorized_keys on that machine. When you try to log in, your SSH client uses a digital signature to prove that you have the private key; the server checks that the signature is valid, and that the public key is authorized for your username; if all is well, you are granted access.

So what is actually inside this private key file?

If you like knowing the details of any sort, this is a post for you!

Or if you start doing topic maps work of interest to hostile others, security will be a concern.

Remember encryption is only one aspect of “security.” Realistic security has multiple layers.

I first saw this in Pete Warden’s Five short links.

May 27, 2013

Automatically Acquiring Synonym Knowledge from Wikipedia

Filed under: Lucene,Solr,Synonymy,Wikipedia — Patrick Durusau @ 7:36 pm

Automatically Acquiring Synonym Knowledge from Wikipedia by Koji Sekiguchi.

From the post:

Synonym search sure is convenient. However, in order for an administrator to allow users to use these convenient search functions, he or she has to provide them with a synonym dictionary (CSV file) described above. New words are created every day and so are new synonyms. A synonym dictionary might have been prepared by a person in charge with huge effort but sometimes will be left unmaintained as time goes by or his/her position is taken over.

That is a reason people start longing for an automatic creation of synonym dictionary. That request has driven me to write the system I will explain below. This system learns synonym knowledge from “dictionary corpus” and outputs “original word – synonym” combinations of high similarity to a CSV file, which in turn can be applied to the SynonymFilter of Lucene/Solr as is.

This “dictionary corpus” is a corpus that contains entries consisting of “keywords” and their “descriptions”. An electronic dictionary exactly is a dictionary corpus and so is Wikipedia, which you are familiar with and is easily accessible.

Let’s look at a method to use the Japanese version of Wikipedia to automatically get synonym knowledge.

Complex representation of synonyms, which includes domain or scope would be more robust.

On the other hand, some automatic generation of synonyms is better than no synonyms at all.

Take this as a good place to start but not as a destination for synonym generation.

Crawl-Anywhere

Filed under: Search Engines,Searching,Solr,Webcrawler — Patrick Durusau @ 1:24 pm

Crawl-Anywhere

From the webpage:

April 2013 – Starting version 4.0, Crawl-Anywhere becomes an open-source project. Current version is 4.0.0-alpha

Stable version 3.x is still available at http://www.crawl-anywhere.com/

(…)

Crawl Anywhere is mainly a web crawler. However, Crawl-Anywhere includes all components in order to build a vertical search engine.

Crawl Anywhere includes :

Project home page : http://www.crawl-anywhere.com/

A web crawler is a program that discovers and read all HTML pages or documents (HTML, PDF, Office, …) on a web site in order for example to index these data and build a search engine (like google). Wikipedia provides a great description of what is a Web crawler : http://en.wikipedia.org/wiki/Web_crawler.

If you are gathering “very valuable intel” as in Snow Crash, a search engine will help.

Not do the heavy lifting but help.

Zero Day / Leaker’s Lottery

Filed under: Cybersecurity,Marketing,Security — Patrick Durusau @ 1:09 pm

This graphic at the Economist:

lottery graphic

made me think of an alternative to brokers for zero day exploits, a Zero Day Lottery!

Take a known reliable source of zero day exploits like “the Grugq” (see: Shopping For Zero-Days: A Price List For Hackers’ Secret Software Exploits and setup a weekly lottery for zero day exploits.

Every week without a winner, rolls another zero day exploit into the final prize package.

Would have to work out the details but authors of zero day exploits included in the prize would share in some percentage of the cash spent on lottery tickets.

The runner of the lottery should get say 20% of the bets with some percentage of the remaining funds being used for contests to develop zero day exploits.

Same principles apply for a Leaker’s Lottery!

Except there some of the proceeds for a leak would be split among the leakers.

Could you be a news or government agency and refuse to buy a ticket?

Or even a large block of tickets?

Consider what the Pentagon Papers would have attracted as a lottery prize.

Zero Day / Leakers Lotteries have the potential to put hacking/leaking on a firm financial basis.

Interested?

Journal of Data Mining & Digital Humanities

Filed under: News,Publishing — Patrick Durusau @ 10:19 am

Journal of Data Mining & Digital Humanities

From the webpage:

Data mining, an interdisciplinary subfield of computer science, involving the methods at the intersection of artificial intelligence, machine learning and database systems. The Journal of Data Mining & Digital Humanities concerned with the intersection of computing and the disciplines of the humanities, with tools provided by computing such as data visualisation, information retrieval, statistics, text mining by publishing scholarly work beyond the traditional humanities.

The journal includes a wide range of fields in its discipline to create a platform for the authors to make their contribution towards the journal and the editorial office promises a peer review process for the submitted manuscripts for the quality of publishing.

Journal of Data Mining & Digital Humanities is an Open Access journal and aims to publish most complete and reliable source of information on the discoveries and current developments in the mode of original articles, review articles, case reports, short communications, etc. in all areas of the field and making them freely available through online without any restrictions or any other subscriptions to researchers worldwide.

The journal is using Editorial Tracking System for quality in review process. Editorial Tracking is an online manuscript submission, review and tracking systems. Review processing is performed by the editorial board members of Journal of Data Mining & Digital Humanities or outside experts; at least two independent reviewers approval followed by editor approval is required for acceptance of any citable manuscript. Authors may submit manuscripts and track their progress through the system, hopefully to publication. Reviewers can download manuscripts and submit their opinions to the editor. Editors can manage the whole submission/review/revise/publish process.

KDNuggets reports the first issue of JDMDH will appear in August, 2013. Deadline for submissions for the first issue: 25 June 2013.

A great venue for topic map focused papers. (When you are not writing for the Economist.)

Older Posts »

Powered by WordPress