Farming communities in Africa and South Asia are becoming increasingly vulnerable to shock as the effects of climate change become a reality. This increased vulnerability, however, comes at a time when improved technology makes critical information more accessible than ever before. aWhere Weather, an online platform offering free weather data for locations in Western, Eastern and Southern Africa and South Asia provides instant and interactive access to highly localized weather data, instrumental for improved decision making and providing greater context in shaping policies relating to agricultural development and global health.
Weather Data in 9km Grid Cells
Weather data is collected at meteorological stations around the world and interpolated to create accurate data in detailed 9km grids. Within each cell, users can access historical, daily-observed and 8 days of daily forecasted ‘localized’ weather data for the following variables:
Precipitation
Minimum and Maximum Temperature
Minimum and Maximum Relative Humidity
Solar Radiation
Maximum and Morning Wind Speed
Growing degree days (dynamically calculated for your base and cap temperature)
These data prove essential for risk adaption efforts, food security interventions, climate-smart decision making, and agricultural or environmental research activities.
At least as a public observer, I could not determine how much “interpolation” is going to the weather data. That would have a major impact on the risk of accepting the data provided at face value.
I suspect it varies from little interpolation at all in heavily instrumented areas to quite a bit in areas with sparser readings. How much is unclear.
It maybe that the amount of interpolation in the data is a factor of whether you use the free version or some upgraded commercial version.
Still, an interesting data source to combine with others, if you are mindful of the risks.
The schedule for CS188.2x hasn’t been announced, yet.
In the meantime, you can register for CS188.1x and peruse the videos, exercises, etc. while you wait for the second part of the course.
From the description:
CS188.1x is a new online adaptation of the first half of UC Berkeley’s CS188: Introduction to Artificial Intelligence. The on-campus version of this upper division computer science course draws about 600 Berkeley students each year.
Artificial intelligence is already all around you, from web search to video games. AI methods plan your driving directions, filter your spam, and focus your cameras on faces. AI lets you guide your phone with your voice and read foreign newspapers in English. Beyond today’s applications, AI is at the core of many new technologies that will shape our future. From self-driving cars to household robots, advancements in AI help transform science fiction into real systems.
CS188.1x focuses on Behavior from Computation. It will introduce the basic ideas and techniques underlying the design of intelligent computer systems. A specific emphasis will be on the statistical and decision–theoretic modeling paradigm. By the end of this course, you will have built autonomous agents that efficiently make decisions in stochastic and in adversarial settings. CS188.2x (to follow CS188.1x, precise date to be determined) will cover Reasoning and Learning. With this additional machinery your agents will be able to draw inferences in uncertain environments and optimize actions for arbitrary reward structures. Your machine learning algorithms will classify handwritten digits and photographs. The techniques you learn in CS188x apply to a wide variety of artificial intelligence problems and will serve as the foundation for further study in any application area you choose to pursue.
Lucene’s facet module has seen some great improvements recently: sizable (nearly 4X) speedups and new features like DrillSideways. The Jira issues search example showcases a number of facet features. Here I’ll describe two recently committed facet features: sorted-set doc-values faceting, already available in 4.3, and dynamic range faceting, coming in the next (4.4) release.
To understand these features, and why they are important, we first need a little background. Lucene’s facet module does most of its work at indexing time: for each indexed document, it examines every facet label, each of which may be hierarchical, and maps each unique label in the hierarchy to an integer id, and then encodes all ids into a binary doc values field. A separate taxonomy index stores this mapping, and ensures that, even across segments, the same label gets the same id.
At search time, faceting cost is minimal: for each matched document, we visit all integer ids and aggregate counts in an array, summarizing the results in the end, for example as top N facet labels by count.
This is in contrast to purely dynamic faceting implementations like ElasticSearch‘s and Solr‘s, which do all work at search time. Such approaches are more flexible: you need not do anything special during indexing, and for every query you can pick and choose exactly which facets to compute.
However, the price for that flexibility is slower searching, as each search must do more work for every matched document. Furthermore, the impact on near-real-time reopen latency can be horribly costly if top-level data-structures, such as Solr’s UnInvertedField, must be rebuilt on every reopen. The taxonomy index used by the facet module means no extra work needs to be done on each near-real-time reopen.
The dynamic range faceting sounds particularly useful.
It’s critical for analysts and presenters of data to share information in a way that people just get it. Enter data storytelling – a magical elixir to all your data communication woes! Well, maybe not quite. But you should be aware of recent efforts using this timeless approach to deliver information so naturally – through stories.
This exercise breaks down a structured (yet casual) introduction to data storytelling through a variety resources. We wanted to provide a diversity of depth and inspiration. Feel free to skip around or follow our 4 week sequence. Print it and post it near the water cooler or slap it to your virtual desktop.
I don’t have a water cooler but I will post “30 Days to Data Storytelling” next to my monitors.
Whatever the subject, knowledge you can’t communicate to others, is lost.
We have been blown away with the number and size of organizations who have downloaded the beta bits of this 100% open source, and native to Windows distribution of Hadoop and engaged Hortonworks and Microsoft around evolving their data architecture to respond to the challenges of enterprise big data.
With this key milestone HDP for Windows offers the millions of customers running their business on Microsoft technologies an ecosystem-friendly Hadoop-based solution that is built for the enterprise and purpose built for Windows. This release cements Apache Hadoop’s role as a key component of the next generation enterprise data architecture, across the broadest set of datacenter configurations as HDP becomes the first production-ready Apache Hadoop distribution to run on both Windows and Linux.
Additionally, customers now also have complete portability of their Hadoop applications between on-premise and cloud deployments via HDP for Windows and Microsofts’s HDInsight Service.
Two lessons here:
First, Hadoop is a very popular way to address enterprise big data.
Second, going where users are, not where they ought to be, is a smart business move.
A molecule editor, i.e. a program facilitating graphical input and interactive editing of molecules, is an indispensable part of every cheminformatics or molecular processing system. Today, when a web browser has become the universal scientific user interface, a tool to edit molecules directly within the web browser is essential. One of the most popular tools for molecular structure input on the web is the JME applet. Since its release nearly 15 years ago, however the web environment has changed and Java applets are facing increasing implementation hurdles due to their maintenance and support requirements, as well as security issues. This prompted us to update the JME editor and port it to a modern Internet programming language – JavaScript.
Summary
The actual molecule editing Java code of the JME editor was translated into JavaScript with help of the Google Web Toolkit compiler and a custom library that emulates a subset of the GUI features of the Java runtime environment. In this process, the editor was enhanced by additional functionalities including a substituent menu, copy/paste, drag and drop and undo/redo capabilities and an integrated help. In addition to desktop computers, the editor supports molecule editing on touch devices, including iPhone, iPad and Android phones and tablets. In analogy to JME the new editor is named JSME. This new molecule editor is compact, easy to use and easy to incorporate into web pages.
Conclusions
A free molecule editor written in JavaScript was developed and is released under the terms of permissive BSD license. The editor is compatible with JME, has practically the same user interface as well as the web application programming interface. The JSME editor is available for download from the project web page http://peter-ertl.com/jsme/
Just in case you were having any doubts about using JavaScript to power an annotation editor.
After over a year of R&D, five milestone releases, and two release candidates, we are happy to release Neo4j 1.9 today! It is available for download effective immediately. And the latest source code is available, as always, on Github.
The 1.9 release adds primarily three things:
Auto-Clustering, which makes Neo4j Enterprise clustering more robust & easier to administer, with fewer moving parts
Cypher language improvements make the language more functionally powerful and more performant, and
New welcome pages make learning easier for new users
Searching through all your content is fine – until you get a mountain of it with similar content, differentiated only by context. Then you’ll need to understand the meaning within the content. In this post I discuss how to do this using semantic techniques…
Organisations today have realised that for certain applications it is useful to have a consolidated search approach over several catalogues. This is most often the case when customers can interact with several parts of the company – sales, billing, service, delivery, fraud checks.
This approach is commonly called Enterprise Search, or Search and Discovery, which is where your content across several repositories is indexed in a separate search engine. Typically this indexing occurs some time after the content is added. In addition, it is not possible for a search engine to understand the fully capabilities of every content system. This means complex mappings are needed between content, meta data and security. In some cases, this may be retrofitted with custom code as the systems do not support a common vocabulary around these aspects of information management.
Content Search
We are all used to content search, so much so that for today’s teenagers a search bar with a common (‘Google like’) grammar is expected. This simple yet powerful interface allows us to search for content (typically web pages and documents) that contain all the words or phrases that we need. Often this is broadened by the use of a thesaurus and word stemming (plays and played stems to the verb play), and combined with some form of weighting based on relative frequency within each unit of content.
Other techniques are also applied. Metadata is extracted or implied – author, date created, modified, security classification, Dublin Core descriptive data. Classification tools can be used (either at the content store or search indexing stages) to perform entity extraction (Cheese is a food stuff) and enrichment (Sheffield is a place with these geospatial co-ordinates). This provides a greater level of description of the term being searched for over and above simple word terms.
Using these techniques, additional search functionality can be provided. Search for all shops visible on a map using a bounding box, radius or polygon geospatial search. Return only documents where these words are within 6 words of each other. Perhaps weight some terms as more important than others, or optional.
These techniques are provided by many of the Enterprise class search engines out there today. Even Open Source tools like Lucene and Solr are catching up with this. They have provided access to information where before we had to rely on Information and Library Services staff to correctly classify incoming documents manually, as they did back in the paper bound days of yore.
Content search only gets you so far though.
I was amening with the best of them until Adam reached the part about MarkLogic 7 going to add Semantic Web capabilities.
I didn’t see any mention of linked data replicating the semantic diversity that currently exists in data stores.
Making data more accessible isn’t going to make it less diverse.
Although making data more accessible may drive the development of ways to manage semantic diversity.
So perhaps there is a useful side to linked data after all.
Google IO wrapped up last week with a tremendous number of data-related announcements. Today’s post is going to focus on Google Compute Engine (GCE), Google’s answer to Amazon’s Elastic Compute Cloud (EC2) that allows you to create and run virtual compute instances within Google’s cloud. We have spent a good amount of time talking about GCE in the past, in particular, benchmarking it against EC2 here, here, here, and here.
The main GCE announcement at IO was, of course, the fact that now **anyone** and **everyone** can try out and use GCE. Yes, GCE instances now support up to 10 terabytes per disk volume, which is a BIG deal. However, the fact that GCE will use minute-by-minute pricing, which might not seem incredibly significant on the surface, is an absolute game changer.
Let’s say that I have a job that will take just a thousand instances each a little bit over an hour to finish (a total of just over a thousand “instance hours”). I launch my thousand instances, run the needed job, and then shut down my cloud 61 minutes later. Let’s also assume that Amazon and Google both charge about the same amount, say $0.50 per instance per hour (a relatively safe assumption) and that Amazon’s and Google’s instances have the same computational horsepower (this is not true, see my benchmark results). As Amazon charges by the hour, Amazon would charge me for two hours per instance or $1000.00 total (1000 instances x $0.50 per instance per hour x 2 hours per instance) whereas Google would only charge me $508.34 (1000 instances x $0.50 per instance per hour x 61/60 hours per instance). In this circumstance, Amazon’s hourly billing has almost doubled my costs but the impact is far worse.
Sean does a great job covering the impact of minute-by-minute pricing for cloud computing.
Great news for the short run and I suspect even greater news for the long run.
What happens when instances and storage become too cheap to meter?
Like domestic long distance telephone service.
When anything that can be computed is within the reach of everyone, what will be computed?
Nina Zumel and I ( John Mount ) have been working very hard on producing an exciting new book called “Practical Data Science with R.” The book has now entered Manning Early Access Program (MEAP) which allows you to subscribe to chapters as they become available and give us feedback before the book goes into print.
Building a search on BillTrack50 is fairly straightforward, however it isn’t exactly like doing a Google search. So there’s a few things you need to keep in mind, which I’ll explain in this post. There’s also a few tips and tricks advanced users might find useful. Any bills that are introduced later and meet your search terms will be automatically added to your bill sheet (if you made a bill sheet).
Tracking “thumb on the scale” (TOTS) at the state level? BillTrack50 is a great starting point.
BillTrack50 provides surface facts, to which you can add vote trading, influence peddling and other routine legislative activities.
Metaphor Identification in Large Texts Corpora by Yair Neuman, Dan Assaf, Yohai Cohen, Mark Last, Shlomo Argamon, Newton Howard, Ophir Frieder. (Neuman Y, Assaf D, Cohen Y, Last M, Argamon S, et al. (2013) Metaphor Identification in Large Texts Corpora. PLoS ONE 8(4): e62343. doi:10.1371/journal.pone.0062343)
Abstract:
Identifying metaphorical language-use (e.g., sweet child) is one of the challenges facing natural language processing. This paper describes three novel algorithms for automatic metaphor identification. The algorithms are variations of the same core algorithm. We evaluate the algorithms on two corpora of Reuters and the New York Times articles. The paper presents the most comprehensive study of metaphor identification in terms of scope of metaphorical phrases and annotated corpora size. Algorithms’ performance in identifying linguistic phrases as metaphorical or literal has been compared to human judgment. Overall, the algorithms outperform the state-of-the-art algorithm with 71% precision and 27% averaged improvement in prediction over the base-rate of metaphors in the corpus.
A deep review of current work and promising new algorithms on metaphor identification.
The biggest is the integration of high resolution (sub km-squared) geostatistics for the entire globe. You can get population density, elevation, weather and more using the new coordinates2statistics API call. Why is this important? No more heatmaps that are just population maps, for the love of god! I'm using this extensively to normalize my data analysis so that I can actually tell which places actually have an unusually high occurrence of X, rather than just having more people.
If you use the DSTK (and you should), do send Pete a note of appreciation.
Forty-seven years after Nowhere Man by the Beatles, a U.S. Senate panel discovers several nowhere men.
A Wall Street Journal Technology Alert:
Apple has set up corporate structures that have allowed it to pay little or no corporate tax–in any country–on much of its overseas income, according to the findings of a U.S. Senate examination.
The unusual result is possible because the iPhone maker’s key foreign subsidiaries argue they are residents of nowhere, according to the investigators’ report, which will be discussed at a hearing Tuesday where Apple CEO Tim Cook will testify. The finding comes from a lengthy investigation into the technology giant’s tax practices by the Senate Permanent Subcommittee on Investigations, led by Sens. Carl Levin (D., Mich.) and John McCain (R., Ariz.).
Apple’s testimony also includes a call to overhaul: “Apple welcomes an objective examination of the US corporate tax system, which has not kept pace with the advent of the digital age and the rapidly changing global economy.”
Tax reform will be useful only if “transparent” tax reform.
Transparent tax reform mean every provision with more than a $100,000 impact on any taxpayer, names all the taxpayers impacted. Whether more or less taxes.
We have the data, we need the will to apply the analysis.
Subgraph matching algorithms are designed to find all instances of predefined subgraphs in a large graph or network and play an important role in the discovery and analysis of so-called network motifs, subgraph patterns which occur more often than expected by chance. We present the index-based subgraph matching algorithm (ISMA), a novel tree-based algorithm. ISMA realizes a speedup compared to existing algorithms by carefully selecting the order in which the nodes of a query subgraph are investigated. In order to achieve this, we developed a number of data structures and maximally exploited symmetry characteristics of the subgraph. We compared ISMA to a naive recursive tree-based algorithm and to a number of well-known subgraph matching algorithms. Our algorithm outperforms the other algorithms, especially on large networks and with large query subgraphs. An implementation of ISMA in Java is freely available at http://sourceforge.net/projects/isma/.
From the introduction:
Over the last decade, network theory has come to play a central role in our understanding of complex systems in fields as diverse as molecular biology, sociology, economics, the internet, and others [1]. The central question in all these fields is to understand behavior at the level of the whole system from the topology of interactions between its individual constituents. In this respect, the existence of network motifs, small subgraph patterns which occur more often in a network than expected by chance, has turned out to be one of the defining properties of real-world complex networks, in particular biological networks [2]. Network motifs act as the fundamental information processing units in cellular regulatory networks [3] and they form the building blocks of larger functional modules (also known as network communities) [4]–[6]. The discovery and analysis of network motifs crucially depends on the ability to enumerate all instances of a given query subgraph in a network or graph of interest, a classical problem in pattern recognition [7], that is known to be NP complete [8].
Heavy sledding but important for exploration of large graphs/networks and the subsequent representation of those findings in a topic map.
As you may remember, I created a little beer graph some time ago to experiment and have fun with beer, and graphs. And yes, I have been having LOTS of fun with it – using it to explain graph concepts to lots of not-so-technical folks, like myself. Many people liked it, and even more people had some questions about it – started thinking in graphs, basically. Which is way more than what I ever hoped for – so that’s great!
One of the questions that people always asked me was about the model. Why did I model things the way I did? Are there no other ways to model this domain? What would be the *best* way to model it? All of these questions have somewhat vague answers, because as a rule, there is no *one way* to model a graph. The data does not determine the model – it’s the QUERY that will drive the modelling decisions.
One of the things that spurred the discussion was – probably not coincidentally – the AlcoholPercentage. Many people were expecting that to be a *property* of the Beerbrand – but instead in my beergraph, I had “pulled it out”. The main reason at the time was more coincidence than anything else, but when you think of it – it’s actually a fantastic thing to “pull things out” and normalise the data model much further than you probably would in a relational model. By making the alcoholpercentage a node of its own, it allowed me to do more interesting queries and pathfinding operations – which led to interesting beer recommendations. Which is what this is all about, right?
(…)
When I read:
All of these questions have somewhat vague answers, because as a rule, there is no *one way* to model a graph. The data does not determine the model – it’s the QUERY that will drive the modelling decisions.
or
…but instead in my beergraph, I had “pulled it out”. The main reason at the time was more coincidence than anything else, but when you think of it – it’s actually a fantastic thing to “pull things out” and normalise the data model much further than you probably would in a relational model.
I don’t feel like I’ve been vague, ever.
Here is my summary of what Rik may have meant:
“no *one way* to model a graph” -> graphs support multiple models of data
“The data does not determine the model ” -> may mean you can create any arbitrary model based on any data
“…the QUERY that will drive the modeling decisions.” -> in topic map terms, what gets represented by a topic (node in a graph) is what you want to talk about (query)
“…pulled it out…”/”…pull things out…” -> represent a subject with a node (graph) or topic (topic maps).
“…normlise the data model much further…” -> The distinction from database normalization isn’t clear, may just be filler.
FuzzyLaw has gathered explanations of legal terms from members of the public in order to get a sense of what the ‘person on the street’ has in mind when they think of a legal term. By making lay-people’s explanations of legal terms available to interpreters, police and other legal professionals, we hope to stimulate debate and learning about word meaning, public understanding of law and the nature of explanation.
The explanations gathered in FuzzyLaw are unusual in that they are provided by members of the public. These people, all aged over 18, regard themselves as ‘native speakers’, ‘first language speakers’ and ‘mother tongue’ speakers of English and have lived in England and/or Wales for 10 years or more. We might therefore expect that they will understand English legal terminology as well as any member of the public might. No one who has contributed has ever worked in the criminal law system or as an interpreter or translator. They therefore bring no special expertise to the task of explanation, beyond whatever their daily life has provided.
We have gathered explanations for 37 words in total. You can see a sample of these explanations on FuzzyLaw. The sample of explanations is regularly updated. You can also read responses to the terms and the explanations from mainly interpreters, police officers and academics. You are warmly invited to add your own responses and join in the discussion of each and every word. Check back regularly to see how discussions develop and consider bookmarking the site for future visits. The site also contains commentaries on interesting phenomena which have emerged through the site. You can respond to the commentaries too on that page, contributing to the developing research project.
(…)
Have you ever wondered that the ‘person on the street’ thinks about relational databases, RDF or the Semantic Web?
Those are the folks who are being pushed content based on interpretations not their own making.
Here’s a work experiment for you:
Take ten search terms from your local query log.
At each department staff meeting, distribute sheets with the words, requesting everyone to define the terms in their own words. No wrong answers.
Tally up the definitions per department and across the company.
From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g. Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.
We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.
Of particular note is the use of an immutable graph as the core data structure for GraphX.
The authors report that GraphX performs less well than PowerGraph (GraphLab 2.1) but promise performance gains and offsetting gains in productivity.
I didn’t find any additional resources at AMPLab on GraphX but did find:
The popular open source project GraphLab received a major boost early this week when a new company comprised of its founding developers, raised funding to develop analytic tools for graph data sets. GraphLab Inc. will continue to use the open source GraphLab to “push the limits of graph computation and develop new ideas”, but having a commercial company will accelerate development, and allow the hiring of resources dedicated to improving usability and documentation.
While social media placed graph data on the radar of many companies, similar data sets can be found in many domains including the life and health sciences, security, and financial services. Graph data is different enough that it necessitates special tools and techniques. Because tools were a bit too complex for casual users, in the past this meant graph data analytics was the province of specialists. Fortunately graph data is an area that has attracted many enthusiastic entrepreneurs and developers. The tools have improved and I expect things to get much easier for users in the future. A great place to learn more about tools for graph data, is at the upcoming GraphLab Workshop (on July 1st in SF).
(…)
Ben summarizes graph resources for:
Data wrangling: creating graphs
Data management and search
Graph-parallel frameworks
Machine-learning and analytics
Visualization
It would be hard to find a better starting place for investigating the buzz about graphs.
I spent quite a few summer vacations as a kid getting dragged around Europe visiting castles and churches. It is definitely an experience that I’m more thankful for now than I was at the time. One of the things that I loved most, even as a child, was seeing the stained glass windows. I have strong memories of being in Notre Dame in Paris and watching the light come in at dawn or staring at the Chartres Cathedral windows for minutes without moving.
As a boy, it wasn’t the history, the architecture or an admiration of the faith involved to build these churches. Those were concepts beyond my ability, knowledge or frankly interest at the time. What I have come to realize only in the past couple of years is that the windows were meant for me. At the base level, I needed something that could grab my attention and hold it. What I have discovered is that from this standpoint, I am no different than the illiterate masses of the Middle Ages or Renaissance. (emphasis in original)
Michel proceeds to make the art of Chartres Cathedral a lesson in data visualization and graphic presentation.
A very powerful lesson.
Does your interface treat communication with users as important?
When the San Francisco Giants won the 2010 World Series, the post-victory celebrations got out of control. Revelers smashed windows, got into fistfights and started fires. A Muni bus and the metaverse were both set alight.
To track the chaos, Eric Eberhardt, a techie from the Bay Area, tuned in to a San Francisco police scanner station on soma.fm — while also listening to music. Something about the combination of ambient music and live police chatter clicked for Eberhardt, and youarelistening.to was born.
Eberhardt’s site is a mash-up of three APIs: police scanner audio from RadioReference.com, ambient music from SoundCloud and images from Flickr. The outcome is like a real-time soundtrack to Michael Mann’s movie “Heat.” My colleague Chase Davis, interactive news assistant editor, describes it as “‘Hearts of Space’ meets ‘The Wire.’”
Definitely has potential to enrich a user experience.
Imagine studying early 21st century history and when George W. Bush or Dick Cheney show up on your ereader, War Pigs plays in the background.
Trivia: Did you know that War Pigs was one of 165 songs that Clear Channel suggested could be inappropriate to play after 9/11? 2001 Clear Channel Memorandum.
Cat Stevens with Peace Train also made the list.
Terrorism we can survive. Those trying to protect us, I’m not so sure.
Dashboards are often created on-the-fly with data being added simply because there is some white space not being used. Different people in the company ask for different data to be displayed and soon the dashboard becomes hard to read and full of meaningless non-related information. When this happens, the dashboard is no longer useful.
This article discusses the steps that need to be taken during the design phase in order to create a useful and actionable dashboard.
Topic maps can be expressed as dashboards as well as other types of interfaces.
Whatever your interface, it needs to be driven by good design principles.
Thanks to Pieter, AnormCypher 0.4.1 supports versions earlier than Neo4j 1.9 (I didn’t realize this was an issue).
AnormCypher is a Cypher-oriented Scala library for Neo4j Server (REST). The goal is to provide a great API for calling arbitrary Cypher and parsing out results, with an API inspired by Anorm from the Play! Framework.
If you are working with a Neo4j Server this may be of interest.
During the last couple of months I have been asked a few times among colleagues and friends hot to get started with Scala. People come to Scala from diverse backgrounds such as… – Java folks looking for a better Java or just tired of waiting for Java features other modern languages such as C# already offer. – Ruby, PHP, and programmers that come from a scripting background looking for type safety. – People trying to bridge the best of both OOP and Functional paradigms. Scala is a vast language full of features with a very technical community. Don’t let your first step discourage you as you don’t need to know everything about Scala to become productive quickly. People in the mailing list will often talk about some crazy shit you don’t need to know just yet. Monads, Monoids, Combinators, Macros and things you may not even know how to pronounce,… Seriously guys as you start to learn about it it’s gonna blow your mind. It’s gonna take some time to digest all the info but it sure it’s worth it. Here is a few resources / steps may help you get started focused on its community and not so much on the technical details of downloading and running your first scala “hello world”
More than a collection to bookmark for “someday,” this is a collection of resources to start following today.
I haven’t looked at all the references but from the ones I checked, I don’t think you will be disappointed.
There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel.
How do you handle ad hoc exploration of data sets as part of planning a topic map?
Being able to “test” merging against data prior to implementation sounds like a good idea.
Graph analysis becomes a key component of data science. A lot of things can be modeled as graphs, but social networks are really one of the most obvious examples.
In this post, I am going to show how one could visualize its own LinkedIn graph, using the LinkedIn API and Gephi, a very nice software for working on this type of data. If you don’t have it yet, just go to http://gephi.org/ and download it now !
My objective is to simply look at my connections (the “nodes” or “vertices” of the graph), see how they relate to each other (the “edges”) and find clusters of strongly connected users (“communities”). This is somewhat emulating what is available already in the InMaps data product, but, hey, this is cool to do it by ourselves, no ?
The first thing to do for running this graph analysis is to be able to query LinkedIn via its API. You really don’t want to get the data by hand… The API uses the oauth authentification protocol, which will let an application make queries on behalf of a user. So go to https://www.linkedin.com/secure/developer and register a new application. Fill the form as required, and in the OAuth part, use this redirect URL for instance:
Great introduction to Gephi!
As a bonus, reinforces the lesson that ETL isn’t required to re-use data.
ETL may be required in some cases but in a world of data APIs those are getting fewer and fewer.
Think of it this way: Non-ETL data access means someone else is paying for maintenance, backups, hardware, etc.
How much of your IT budget is supporting duplicated data?
Googles Maps is preparing to debut its newly revamped Google Maps. Terming it “smart recommendations” the new functionality of Google Maps is intended to be more interactive and custom tailored to the specific user. The more you use the map to search for locations, favorite items by starring them, and write location reviews, the more unique the map becomes. Clicking a specific business or feature will result in the map features adjusting to show roads and locations related to that place.
(…)
Previewing the new Google Maps is currently only available by invite at the moment. You can request your invite via the Preview page.
Technology could be exposing you to a broader view of the world, perhaps even as other see it.
Instead:
Apple brought us ear buds that wall us off from ambient sound and others.
Apple also brought us eye buds (iPhones) that wall us off from our visual surroundings.
Google is building brain buds to wrap you in a customized cocoon of content.
Ironic if you remember the original MacIntosh commercial:
The United Nations Education Scientific and Cultural Organisation (UNESCO) has announced that it is making available to the public free of charge its digital publications and data. This comes after UNESCO has adopted an Open Access Policy, becoming the first agency within the United Nations to do so.
The new policy implies that anyone can freely download, translate , adapt, and distribute UNESCO’s publications and data. The policy also states that from July 2013, hundreds of downloadable digital UNESCO publications will be available to users through a new Open Access Repository with a multilingual interface. The policy seeks also to apply retroactively to works that have been published.
There’s a treasure trove of information for mapping, say against the New York Times historical archives.
If presidential libraries weren’t concerned with helping former administration officials avoid accountability, digitizing presidential libraries for complete access, would be another great treasure trove.