NLTK-like Wordnet Interface in Scala

April 16th, 2014

NLTK-like Wordnet Interface in Scala by Sujit Pal.

From the post:

I recently figured out how to setup the Java WordNet Library (JWNL) for something I needed to do at work. Prior to this, I have been largely unsuccessful at figuring out how to access Wordnet from Java, unless you count my one attempt to use the Java Wordnet Interface (JWI) described here. I think there are two main reason for this. First, I just didn’t try hard enough, since I could get by before this without having to hook up Wordnet from Java. The second reason was the over-supply of libraries (JWNL, JWI, RiTa, JAWS, WS4j, etc), each of which annoyingly stops short of being full-featured in one or more significant ways.

The one Wordnet interface that I know that doesn’t suffer from missing features comes with the Natural Language ToolKit (NLTK) library (written in Python). I have used it in the past to access Wordnet for data pre-processing tasks. In this particular case, I needed to call it at runtime from within a Java application, so I finally bit the bullet and chose a library to integrate into my application – I chose JWNL based on seeing it being mentioned in the Taming Text book (and used in the code samples). I also used code snippets from Daniel Shiffman’s Wordnet page to learn about the JWNL API.

After I had successfully integrated JWNL, I figured it would be cool (and useful) if I could build an interface (in Scala) that looked like the NLTK Wordnet interface. Plus, this would also teach me how to use JWNL beyond the basic stuff I needed for my webapp. My list of functions were driven by the examples from the Wordnet section (2.5) from the NLTK book and the examples from the NLTK Wordnet Howto. My Scala class implements most of the functions mentioned on these two pages. The following session will give you an idea of the coverage – even though it looks a Python interactive session, it was generated by my JUnit test. I do render the Synset and Word (Lemma) objects using custom format() methods to preserve the illusion (and to make the output readable), but if you look carefully, you will notice the rendering of List() is Scala’s and not Python’s.

NLTK is amazing in its own right and creating a Scala interface will give you an excuse to learn Scala. That’s a win-win situation!

Confirmation: Gov. at War with Business

April 16th, 2014

The U.S. Government: Paying to Undermine Internet Security, Not to Fix It by Julia Angwin.

The Heartbleed computer security bug is many things: a catastrophic

tech failure, an open invitation to criminal hackers and yet another reason to upgrade our passwords on dozens of websites. But more than anything else, Heartbleed reveals our neglect of Internet security.

The United States spends more than $50 billion a year on spying and intelligence, while the folks who build important defense software — in this case a program called OpenSSL that ensures that your connection to a website is encrypted — are four core programmers, only one of whom calls it a full-time job.

In a typical year, the foundation that supports OpenSSL receives just $2,000 in donations. The programmers have to rely on consulting gigs to pay for their work. “There should be at least a half dozen full time OpenSSL team members, not just one, able to concentrate on the care and feeding of OpenSSL without having to hustle commercial work,” says Steve Marquess, who raises money for the project.

Is it any wonder that this Heartbleed bug slipped through the cracks?

Dan Kaminsky, a security researcher who saved the Internet from a similarly fundamental flaw back in 2008, says that Heartbleed shows that it’s time to get “serious about figuring out what software has become Critical Infrastructure to the global economy, and dedicating genuine resources to supporting that code.”

I said last week in NSA … *ucked Up …TCP/IP that the NSA was at war with the U.S. computer industry and ecommerce in general.

Turns out I was overly optimistic, the entire United States government has an undeclared war against the U.S. computer industry and ecommerce in general.

If I were one of the $million high-tech donors to any political party, I would put everyone on notice that the money tap is off.

At least until there is verifiable progress towards building a robust architecture for ecommerce and an end to attacks on the U.S. computer industry, either actively or by neglect by U.S. agencies.

Only the sound of money will get the attention of the 530+ members of the U.S. Congress. When the money goes silent, you will have their full and undivided attention.

Regular expressions unleashed

April 16th, 2014

Regular expressions unleashed by Hans-Juergen Schoenig.

From the post:

When cleaning up some old paperwork this weekend I stumbled over a very old tutorial. In fact, I have received this little handout during a UNIX course I attended voluntarily during my first year at university. It seems that those two days have really changed my life – the price tag: 100 Austrian Schillings which translates to something like 7 Euros in today’s money.

When looking at this old thing I noticed a nice example showing how to test regular expression support in grep. Over the years I had almost forgotten this little test. Here is the idea: There is no single way to print the name of Libya’s former dictator. According to this example there are around 30 ways to do it:…

Thirty (30) sounds a bit low to me but it’s sufficient to point out that mining all thirty (30) is going to give you a number of false positives, when searching for news on the former dictator of Libya.

The regex to capture all thirty (30) variant forms in a PostgreSQL database is great but once you have it, now what?

Particularly if you have sorted out the dictator from the non-dictators and/or placed them in other categories.

Do you pass that sorting and classifying onto the next user or do you flush the knowledge toilet and all that hard work just drains away?

Learn regex the hard way

April 16th, 2014

Learn regex the hard way by Zed A. Shaw.

From the preface:

This is a rough in-progress dump of the book. The grammar will probably be bad, there will be sections missing, but you get to watch me write the book and see how I do things.

Finally, don’t forget that I have href{http://learnpythonthehardway.org}{Learn Python The Hard Way, 2nd Edition} which you should read if you can’t code yet.

Exercises 1 – 16 have some content (out of 27) so it is incomplete but still a goodly amount of material.

Zed has other “hard way” titles on:

Regexes are useful all contexts so you won’t regret learning or brushing up on them.

…Generalized Language Models…

April 16th, 2014

How Generalized Language Models outperform Modified Kneser Ney Smoothing by a Perplexity drop of up to 25% by René Pickhardt.

René reports on the core of his dissertation work.

From the post:

When you want to assign a probability to a sequence of words you will run into the Problem that longer sequences are very rare. People fight this problem by using smoothing techniques and interpolating longer order models (models with longer word sequences) with lower order language models. While this idea is strong and helpful it is usually applied in the same way. In order to use a shorter model the first word of the sequence is omitted. This will be iterated. The Problem occurs if one of the last words of the sequence is the really rare word. In this way omiting words in the front will not help.

So the simple trick of Generalized Language models is to smooth a sequence of n words with n-1 shorter models which skip a word at position 1 to n-1 respectively.

Then we combine everything with Modified Kneser Ney Smoothing just like it was done with the previous smoothing methods.

Unlike some white papers, webinars and demos, you don’t have to register, list your email and phone number, etc. to see both the test data and code that implements René’s ideas.

Data, Source.

Please send René useful feedback as a way to say thank you for sharing both data and code.

‘immersive intelligence’ [Topic Map-like application]

April 16th, 2014

Long: NGA is moving toward ‘immersive intelligence’ by Sean Lyngaas.

From the post:

Of the 17 U.S. intelligence agencies, the National Geospatial-Intelligence Agency is best suited to turn big data into actionable intelligence, NGA Director Letitia Long said. She told FCW in an April 14 interview that mapping is what her 14,500-person agency does, and every iota of intelligence can be attributed to some physical point on Earth.

“We really are the driver for intelligence integration because everything is somewhere on the Earth at a point in time,” Long said. “So we give that ability for all of us who are describing objects to anchor it to the Map of the World.”

NGA’s Map of the World entails much more minute information than the simple cartography the phrase might suggest. It is a mix of information from top-secret, classified and unclassified networks made available to U.S. government agencies, some of their international partners, commercial users and academic experts. The Map of the World can tap into a vast trove of satellite and social media data, among other sources.

NGA has made steady progress in developing the map, Long said. Nine data layers are online and available now, including those for maritime and aeronautical data. A topography layer will be added in the next two weeks, and two more layers will round out the first operational version of the map in August.

Not surprisingly, the National Geospatial-Intelligence Agency sees geography as the organizing principal for intelligence integration. Or as as NGA Director Long says: “…everything is somewhere on the Earth at a point in time.” I can’t argue with the accuracy of that statement, save for extraterrestrial events, satellites, space-based weapons, etc.

On the other hand, you could gather intelligence by point of origin, places referenced, people mentioned (their usual locations), etc., in languages spoken by more than thirty (30) million people and you could have a sack with intelligence in forty (40) languages. List of languages by number of native speakers

When I say “topic map-like” application, I mean that the NGA has chosen geographic locations as the organizing principle for intelligence as opposed to using subjects as the organizing principle for intelligence, of which geographic location is only one type. Noting that with a broader organizing principle, it would be easier to integrate data from other agencies who have their own organizational principles for the intelligence they gather.

I like the idea of “layers” as described in the post. In part because a topic map can exist as an additional layer on top of the current NGA layers to integrate other intelligence data on a subject basis with the geographic location system of the NGA.

Think of topic maps as being “in addition to” and not “instead of” your current integration technology.

What’s your principle for organizing intelligence? Would it be useful to integrate data organized around other principles for organizing intelligence? And still find the way back to the original data?

PS: Do you remember the management book “Who Moved My Cheese?” Moving intelligence from one system to another can result in: “Who Moved My Intelligence?,” when it can no longer be discovered by its originator. Not to mention the intelligence will lack the context of its point of origin.

Titan: Scalable Graph Database

April 15th, 2014

Titan: Scalable Graph Database by Matthias Broecheler.

Conference presentation so long on imagery but short on detail. ;-)

However, useful to walk your manager through as a pitch for support to investigate further.

When that support is given, check out: http://thinkaurelius.github.io/titan/. Links to source code, other resources, etc.

Enter, Update, Exit… [D3.js]

April 15th, 2014

Enter, Update, Exit – An Introduction to D3.js, The Web’s Most Popular Visualization Toolkit by Christian Behrens.

From the webpage:

Over the past couple of years, D3, the groundbreaking JavaScript library for data-driven document manipulation developed by Mike Bostock, has become the Swiss Army knife of web-based data visualization. However, talking to other designers or developers who use D3 in their projects, I noticed that one of the core concepts of it remains somewhat obscure and is often referred to as »D3’s magic«: Data joins and selections.

Given a solid command of basic JavaScript, this article should help you to wrap your head around these two fundamental concepts and get you started using D3 for your dataviz projects.

If you encounter anyone not already using D3.js, pass this page along to them.

I first saw this in a tweet by Halftone.

GraphChi-DB [src released]

April 15th, 2014

GraphChi-DB

From the webpage:

GraphChi-DB is a scalable, embedded, single-computer online graph database that can also execute similar large-scale graph computation as GraphChi. it has been developed by Aapo Kyrola as part of his Ph.D. thesis.

GraphChi-DB is written in Scala, with some Java code. Generally, you need to know Scala quite well to be able to use it.

IMPORTANT: GraphChi-DB is early release, research code. It is buggy, it has awful API, and it is provided with no guarantees. DO NOT USE IT FOR ANYTHING IMPORTANT.

GraphChi-DB source code arrives!

Enjoy!

Wandora – New Version [TMQL]

April 15th, 2014

Wandora – New Version

From the webpage:

It is over six months since last Wandora release. Now we are finally ready to publish new version with some very interesting new features. Release 2014-04-15 features TMQL support and embedded HTML browser, for example. TMQL is the topic map query language and Wandora allows the user to search, query and modify topics and associations with TMQL scripts. Embedded HTML browser expands Wandora’s internal visualizations repertoire. Wandora embedded HTTP server services are now available inside the Wandora application….

Change Log, Download.

Two of the biggest changes:

Download your copy today!

I will post a review by mid-May, 2014.

Interested to hear your comments, questions and suggestions in the mean time.

BTW, the first suggestion I have is that the download file should NOT be wandora.zip but rather wandora-(date).zip if nothing else. Ditto for the source files and javadocs.

The Bw-Tree: A B-tree for New Hardware Platforms

April 15th, 2014

The Bw-Tree: A B-tree for New Hardware Platforms by Justin J. Levandoski, David B. Lomet, and, Sudipta Sengupta.

Abstract:

The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B-tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance.

With easy availability of multi-core chips, what new algorithms are you going to discover while touring SICP or TAOCP?

Is that going to be an additional incentive to tour one or both of them?

Why and How to Start Your SICP Trek

April 15th, 2014

Why and How to Start Your SICP Trek by Kai Wu.

From the post:

This post was first envisioned for those at Hacker Retreat – or thinking of attending – before it became more general. It’s meant to be a standing answer to the question, “How can I best improve as a coder?”

Because I hear that question from people committed to coding – i.e. professionally for the long haul – the short answer I always give is, “Do SICP!” *

Since that never seems to be convincing enough, here’s the long answer. :) I’ll give a short overview of SICP’s benefits, then use arguments from (justified) authority and argument by analogy to convince you that working through SICP is worth the time and effort. Then I’ll share some practical tips to help you on your SICP trek.

* Where SICP = The Structure and Interpretation of Computer Programs by Hal Abelson and Gerald Sussman of MIT, aka the Wizard book.

BTW, excuse my enthusiasm for SICP if it comes across at times as monolingual theistic fanaticism. I’m aware that there are many interesting developments in CS and software engineering outside of the Wizard book – and no single book can cover everything. Nevertheless, SICP has been enormously influential as an enduring text on the nature and fundamentals of computing – and tends to pay very solid dividends on your investments of attention.

A great post with lots of suggestions on how to work your way through SICP.

What it can’t supply is the discipline to actually make your way through SICP.

I was at a Unicode Conference some years ago and met Don Knuth. I said something in the course of the conversation about reading some part of TAOCP and Don said rather wistfully that he wished he would met someone who had read it all.

It seems sad that so many of us have dipped into it here or there but not really taken the time to explore it completely. Rather like reading Romeo and Juliet for the sexy parts and ignoring the rest.

Do you have a reading plan for TAOCP after you finish SICP?

I first saw this in a tweet by Computer Science.

Erik Meijer – Haskell – MOOC

April 15th, 2014

Erik Meijer tweeted:

Your opportunity to influence my upcoming #Haskell MOOC on EdX. Submit pull requests on the contents here: https://github.com/fptudelft/MOOC-Course-Description

What are your suggestions?

SIGBOVIK 2014

April 14th, 2014

SIGBOVIK 2014 (pdf)

From the cover page:

The Association for Computational Heresy

presents

A record of the Proceeding of

SIGBOVIK 2014

The eight annual intercalary robot dance in celebration of workshop on symposium about Harry Q. Bovik’s 26th birthday.

Just in case news on computer security is as grim this week as last, something to brighten your spirits.

Enjoy!

I first saw this in a tweet by John Regehr.

tagtog: interactive and text-mining-assisted annotation…

April 14th, 2014

tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles by Juan Miguel Cejuela, et al.

Abstract:

The breadth and depth of biomedical literature are increasing year upon year. To keep abreast of these increases, FlyBase, a database for Drosophila genomic and genetic information, is constantly exploring new ways to mine the published literature to increase the efficiency and accuracy of manual curation and to automate some aspects, such as triaging and entity extraction. Toward this end, we present the ‘tagtog’ system, a web-based annotation framework that can be used to mark up biological entities (such as genes) and concepts (such as Gene Ontology terms) in full-text articles. tagtog leverages manual user annotation in combination with automatic machine-learned annotation to provide accurate identification of gene symbols and gene names. As part of the BioCreative IV Interactive Annotation Task, FlyBase has used tagtog to identify and extract mentions of Drosophila melanogaster gene symbols and names in full-text biomedical articles from the PLOS stable of journals. We show here the results of three experiments with different sized corpora and assess gene recognition performance and curation speed. We conclude that tagtog-named entity recognition improves with a larger corpus and that tagtog-assisted curation is quicker than manual curation.

Database URL: www.tagtog.net, www.flybase.org.

Encouraging because the “tagging” is not wholly automated nor is it wholly hand-authored. Rather the goal is to create an interface that draws on the strengths of automated processing as moderated by human expertise.

Annotation remains at a document level, which consigns subsequent users to mining full text but this is definitely a step in the right direction.

3 Common Time Wasters at Work

April 13th, 2014

3 Common Time Wasters at Work by Randy Krum.

See Randy’s post for the graphic but #2 was:

Non-work related Internet Surfing

It occurred to me that “Non-work related Internet Surfing” is indistinguishable from….search. At least at arm’s length or better.

And so many people search poorly that a lack of useful results is easy to explain.

Yes?

So, what is the strategy to get the rank and file to use more efficient information systems than search?

Their non-use or non-effective use of your system can torpedo a sale just as quickly as any other cause.

Suggestions?

Clojure and Storm

April 13th, 2014

Two recent posts by Charles Ditzel, here and here, offer pointers to resources on Clojure and Storm.

For your reading convenience:

Storm Apache site.

Building an Activity Feed Stream with Storm Recipe from the Clojure Cookbook.

Storm: distributed and fault-tolerant realtime computation by Nathan Marz. (Presentation, 2012)

Storm Topologies

StormScreenCast2 Storm in use at Twitter (2011)

Enjoy!

Cross-Scheme Management in VocBench 2.1

April 13th, 2014

Cross-Scheme Management in VocBench 2.1 by Armando Stellato.

From the post:

One of the main features of the forthcoming VB2.1 will be SKOS Cross-Scheme Management

I started drafting some notes about cross-scheme management here: https://art-uniroma2.atlassian.net/wiki/display/OWLART/SKOS+Cross-Scheme+Management

I think it is important to have all the integrity checks related to this aspect clear for humans, and not only have them sealed deep in the code. These notes will help users get acquainted with this feature in advance. Once completed, these will be included also in the manual of VB.

For the moment I’ve only written the introduction, some notes about data integrity and then described the checks carried upon the most dangerous operation: removing a concept from a scheme. Together with the VB development group, we will add more information in the next days. However, if you have some questions about this feature, you may post them here, as usual (or you may use the vocbench user/developer user groups).

A consistent set of operations and integrity checks for cross-scheme are already in place for this 2.1, which will be released in the next days.

VB2.2 will focus on other aspects (multi-project management), while we foresee a second wave of facilities for cross-scheme management (such as mass-move/add/remove actions, fixing utilities, analysis of dangling concepts, corrective actions etc..) for VB2.3

I agree that:

I think it is important to have all the integrity checks related to this aspect clear for humans, and not only have them sealed deep in the code.

But I am less certain that following the integrity checks of SKOS is useful in all mappings between schemes.

If you are interested in such constraints, see Armando’s notes.

Online Python Tutor (update)

April 13th, 2014

Online Python Tutor by Philip Guo.

From the webpage:

Online Python Tutor is a free educational tool created by Philip Guo that helps students overcome a fundamental barrier to learning programming: understanding what happens as the computer executes each line of a program’s source code. Using this tool, a teacher or student can write a Python program in the Web browser and visualize what the computer is doing step-by-step as it executes the program.

As of Dec 2013, over 500,000 people in over 165 countries have used Online Python Tutor to understand and debug their programs, often as a supplement to textbooks, lecture notes, and online programming tutorials. Over 6,000 pieces of Python code are executed and visualized every day.

Users include self-directed learners, students taking online courses from Coursera, edX, and Udacity, and professors in dozens of universities such as MIT, UC Berkeley, and the University of Washington.

If you believe in crowd wisdom, 500,000 users is a vote of confidence in the Online Python Tutor.

I first mentioned the Online Python Tutor in LEARN programming by visualizing code execution

Philip points to similar online tutors for Java, Ruby and Javascript.

Enjoy!

The Heartbleed Hit List:…

April 13th, 2014

The Heartbleed Hit List: The Passwords You Need to Change Right Now (Mashable)

An incomplete “hit list” for the Heartbleed bug.

Useful in case you are worried about changing passwords but also for who is or was vulnerable and who’s not.

The financial sector comes up not vulnerable now and wasn’t vulnerable in the past. No exceptions.

Mashable points out banks have been warned to update their systems.

Makes me curious if the financial sector is so far behind on the technology curve they haven’t reached using OpenSSL or regulators of the financial sector know that little about the software that underpins the financial sector? Some of both? Some other explanation?

I first saw this in a tweet by MariaJesusV.

Will Computers Take Your Job?

April 13th, 2014

Probability that computers will take away your job posted by Jure Leskovec.

jobs taken by computers

For your further amusement, I recommend the full study, “The Future of Employment: How Susceptible are Jobs to Computerisation?” by C. Frey and M. Osborne (2013).

The lower the number, the less likely for computer replacement:

  • Logisticians – #55, more replaceable than Rehabilitation Counselors at #47.
  • Computer and Information Research Scientists – #69, more replaceable than Public Relations and Fundraising Managers at #67. (Sorry Don.)
  • Astronomers – #128, more replaceable than Credit Counselors at #126.
  • Dancers – #179? I’m not sure the authors have even seen Paula Abdul dance.
  • Computer Programmers – #293, more replaceable than Historians at #283.
  • Bartenders – #422. Have you ever told a sad story to a coin-operated vending machine?
  • Barbers – #439. Admittedly I only see barbers at a distance but if I wanted one, I would prefer human one.
  • Technical Writers – #526. The #1 reason why technical documentation is so poor. Technical writers are under appreciated and treated like crap. Good technical writing should be less replaceable by computers than Lodging Managers at #12.
  • Tax Examiners and Collectors, and Revenue Agents – #586. Stop cheering so loudly. You are frightening other cube dwellers.
  • Umpires, Referees, and Other Sports Officials – 684. Now cheer loudly! ;-)

If the results strike you as odd, consider this partial description of the approach taken to determine if a job could be taken over by a computer:

First, together with a group of ML researchers, we subjectively hand-labelled 70 occupations, assigning 1 if automatable, and 0 if not. For our subjective assessments, we draw upon a workshop held at the Oxford University Engineering Sciences Department, examining the automatability of a wide range of tasks. Our label assignments were based on eyeballing the O∗NET tasks and job description of each occupation. This information is particular to each occupation, as opposed to standardised across different jobs. The hand-labelling of the occupations was made by answering the question “Can the tasks of this job be sufficiently specified, conditional on the availability of big data, to be performed by state of the art computer-controlled equipment”. Thus, we only assigned a 1 to fully automatable occupations, where we considered all tasks to be automatable. To the best of our knowledge, we considered the possibility of task simplification, possibly allowing some currently non-automatable tasks to be automated. Labels were assigned only to the occupations about which we were most confident. (at page 30)

Not to mention that occupations were considered for automation on the basis of nine (9) variables.

Would you believe that semantics isn’t mentioned once in this paper? So now you know why I have issues with its methodology and conclusions. What do you think?

Testing Lucene’s index durability after crash or power loss

April 12th, 2014

Testing Lucene’s index durability after crash or power loss by Mike McCandless.

From the post:

One of Lucene’s useful transactional features is index durability which ensures that, once you successfully call IndexWriter.commit, even if the OS or JVM crashes or power is lost, or you kill -KILL your JVM process, after rebooting, the index will be intact (not corrupt) and will reflect the last successful commit before the crash.

If anyone at your startup is writing an indexing engine, be sure to pass this post from Mike along.

Ask them for a demonstration of equal durability of the index before using their work instead of Lucene.

You have enough work to do without replicating (poorly) work that already has enterprise level reliability.

Read Access on Google Production Servers

April 12th, 2014

How we got read access on Google’s production servers

From the post:

To stay on top on the latest security alerts we often spend time on bug bounties and CTF’s. When we were discussing the challenge for the weekend, Mathias got an interesting idea: What target can we use against itself?

Of course. The Google search engine!

What would be better than to scan Google for bugs other than by using the search engine itself? What kind of software tend to contain the most vulnerabilities?

  • Old and deprecated software
  • Unknown and hardly accessible software
  • Proprietary software that only a few people have access to
  • Alpha/Beta releases and otherwise new technologies (software in early stages of it’s lifetime)

I read recently that computer security defense is 10 years behind computer security offense.

Do you think that’s in part due to the difference in sharing of information between the two communities?

Computer offense aggressively sharing and computer defense aggressively hording.

Yes?

If you are interested in a less folklorish way of gathering computer security information (such as all the software versions that are known to have the Heartbeat SSL issue), think about using topic maps.

Reasoning that the pattern that lead to the Heartbeat SSL memory leak was not unique.

As you build a list of Heartbeat susceptible software, you have a suspect list for similar issues. Find another leak and you can associate it with all those packages, subject to verification.

BTW, a good starting point for your research, the detectify blog.

Faceboook Gets Smarter with Graph Engine Optimization

April 12th, 2014

Faceboook Gets Smarter with Graph Engine Optimization by Alex Woodie.

From the post:

Last fall, the folks in Facebook’s engineering team talked about how they employed the Apache Giraph engine to build a graph on its Hadoop platform that can host more than a trillion edges. While the Graph Search engine is capable of massive graphing tasks, there were some workloads that remained outside the company’s technical capabilities–until now.

Facebook turned to the Giraph engine to power its new Graph Search offering, which it unveiled in January 2013 as a way to let users perform searches on other users to determine, for example, what kind of music their Facebook friends like, what kinds of food they’re into, or what activities they’ve done recently. An API for Graph Search also provides advertisers with a new revenue source for Facebook. It’s likely the world’s largest graph implementation, and a showcase of what graph engines can do.

The company picked Giraph because it worked on their existing Hadoop implementation, including HDFS and its MapReduce infrastructure stack (known as Corona). Compared to running the computation workload on Hive, an internal Facebook test of a 400-billion edge graph ran 126x faster on Giraph, and had a 26x performance advantage, as we explained in a Datanami story last year.

When Facebook scaled its internal test graph up to 1 trillion edges, they were able to keep the processing of each iteration of the graph under four minutes on a 200-server cluster. That amazing feat was done without any optimization, the company claimed. “We didn’t cheat,” Facebook developer Avery Ching declared in a video. “This is a random hashing algorithm, so we’re randomly assigning the vertices to different machines in the system. Obviously, if we do some separation and locality optimization, we can get this number down quite a bit.”

High level view with technical references on how Facebook is optimizing its Apache Giraph engine.

If you are interested in graphs, this is much more of a real world scenario than building “big” graphs out of uniform time slices.

PyCon US 2014 – Videos (Tutorials)

April 12th, 2014

The tutorial videos from PyCon US 2014 are online! Talks to follow.

Tutorials arranged by author for your finding convenience:

  • Blomo, Jim mrjob: Snakes on a Hadoop

    This tutorial will take participants through basic usage of mrjob by writing analytics jobs over Yelp data. mrjob lets you easily write, run, and test distributed batch jobs in Python, on top of Hadoop. Hadoop is a MapReduce platform for processing big data but requires a fair amount of Java boilerplate. mrjob is an open source Python library written by Yelp used to process TBs of data every day.
  • Clifford, Williams, G. 0 to 00111100 with web2py

    This tutorial teaches basic web development for people who have some experience with HTML. No experience with CSS or JavaScript is required. We will build a basic web application using AJAX, web forms, and a local SQL database.
  • Grisel, Olivier; Jake, Vanderplas Exploring Machine Learning with Scikit-learn

    This tutorial will offer an introduction to the core concepts of machine learning, and how they can be easily applied in Python using Scikit-learn. We will use the scikit-learn API to introduce and explore the basic categories of machine learning problems, related topics such as feature selection and model validation, and the application of these tools to real-world data sets.
  • Love, Kenneth Getting Started with Django, a crash course

    Getting Started With Django is a well-established series of videos teaching best practices and common approaches for building web apps to people new to Django. This tutorial combines the first few lessons into a single lesson. Attendees will follow along as I start and build an entire simple web app and, network permitting, deploy it to Heroku.
  • Ma, Eric How to formulate a (science) problem and analyze it using Python code

    Are you interested in doing analysis but don’t know where to start? This tutorial is for you. Python packages & tools (IPython, scikit-learn, NetworkX) are powerful for performing data analysis. However, little is said about formulating the questions and tying these tools together to provide a holistic view of the data. This tutorial will provide you with an introduction on how this can be done.
  • Müller, Mike Descriptors and Metaclasses – Understanding and Using Python's More Advanced Features

    Descriptors and metaclasses are advanced Python features. While it is possible to write Python programs without active of knowledge of them, knowing how they work provides a deeper understanding about the language. Using examples, you will learn how they work and when to use as well as when better not to use them. Use cases provide working code that can serve as a base for own solutions.
  • Vanderplas, Jake; Olivier Grisel Exploring Machine Learning with Scikit-learn

    This tutorial will offer an introduction to the core concepts of machine learning, and how they can be easily applied in Python using Scikit-learn. We will use the scikit-learn API to introduce and explore the basic categories of machine learning problems, related topics such as feature selection and model validation, and the application of these tools to real-world data sets.

Tutorials or talks with multiple authors are listed under each author. (I don’t know which one you will remember.)

I am going to spin up the page for the talks so when the videos appear, all I need do is to insert the video links.

Enjoy!

Lost Boolean Operator?

April 12th, 2014

Binary Boolean Operator: The Lost Levels

From the post:

The most widely known of these four siblings is operator number 11. This operator is called the “material conditional”. It is used to test if a statement fits the logical pattern “P implies Q”. It is equivalent to !P || Q by the material implication.

I only know one language that implementes this operation: VBScript.

The post has a good example of why material conditional is useful.

Will your next language have a material conditional operator?

I first saw this in Pete Warden’s Five short links for April 3, 2014.

Prescription vs. Description

April 12th, 2014

Kurt Cagle posted this image on Facebook:

engineers-vs-physicists

with this comment:

The difference between INTJs and INTPs in a nutshell. Most engineers, and many programmers, are INTJs. Theoretical scientists (and I’m increasingly putting data scientists in that category) are far more INTPs – they are observers trying to understand why systems of things work, rather than people who use that knowledge to build, control or constrain those systems.

I would rephrase the distinction to be one of prescription (engineers) versus description (scientists) but that too is a false dichotomy.

You have to have some real or imagined description of a system to start prescribing for it and any method for exploring a system has some prescriptive aspects.

The better course is to recognize exploring or building systems has some aspects of both. Making that recognition, may (or may not) make it easier to discuss assumptions of either perspective that aren’t often voiced.

Being more from the descriptive side of the house, I enjoy pointing out that behind most prescriptive approaches are software and services to help you implement those prescriptions. Hardly seems like an unbiased starting point to me. ;-)

To be fair, however, the descriptive side of the house often has trouble distinguishing between important things to describe and describing everything it can to system capacity, for fear of missing some edge case. The “edge” cases may be larger than the system but if they lack business justification, pragmatics should reign over purity.

Or to put it another way: Prescription alone is too brittle and description alone is too endless.

Effective semantic modeling/integration needs to consist of varying portions of prescription and description depending upon the requirements of the project and projected ROI.

PS: The “ROI” of a project not in your domain, that doesn’t use your data, your staff, etc. is not a measure of the potential “ROI” for your project. Crediting such reports is “ROI” for the marketing department that created the news. Very important to distinguish “your ROI” from “vendor’s ROI.” Not the same thing. If you need help with that distinction, you know where to find me.

Hemingway App

April 11th, 2014

Hemingway App

We are a long way from something equivalent to Hemingway App for topic maps or other semantic technologies but it struck me that may not always be true.

Take it for a spin and see what you think.

What modifications would be necessary to make this concept work for a semantic technology?

Definitions Extractions from the Code of Federal Regulations

April 11th, 2014

Definitions Extractions from the Code of Federal Regulations by Mohamma M. AL Asswad, Deepthi Rajagopalan, and Neha Kulkarni. (poster)

From a description of the project:

Imagine you’re opening a new business that uses water in the production cycle. If you want to know what federal regulations apply to you, you might do a Google search that leads to the Code of Federal Regulations. But that’s where it gets complicated, because the law contains hundreds of regulations involving water that are difficult to narrow down. (The CFR alone contains 13898 references to water.) For example, water may be defined one way when referring to a drinkable liquid and another when defined as an emission from a manufacturing facility. If the regulation says your water must maintain a certain level of purity, to which water are they referring? Definitions are the building blocks of the law, and yet pouring through them to find what applies to you is frustrating to an average business owner. Computer automation might help, but how can a computer understand exactly what kind of water you’re looking for? We at the Legal Information Institute think this is pretty important challenge, and apparently Google does too.

Looking forward to learning more about this project!

BTW, this is the same Code of Federal Regulations that some members of Congress don’t think needs to be indexed.

Knowing what legal definitions apply is a big step towards making legal material more accessible.

Google Top 10 Search Tips

April 11th, 2014

Google Top 10 Search Tips by Karen Blakeman.

From the post:

These are the top 10 tips from the participants of a recent workshop on Google, organised by UKeiG and held on 9th April 2014. The edited slides from the day can be found on authorSTREAM at http://www.authorstream.com/Presentation/karenblakeman-2121264-making-google-behave-techniques-better-results/ and on Slideshare at http://www.slideshare.net/KarenBlakeman/making-google-behave-techniques-for-better-results

Ten search tips from the trenches. Makes a very nice cheat sheet.