Big data has been around since the beginning of time, says Thomas Paulmichl, founder and CEO of Sigmaspecto, who says that what has changed is how we process the information. In a talk during Big Data Week, Paulmichl encouraged people to open up their perspective on what big data is, and how it can be applied.
During the talk, he admonished people to take a human element into big data. Paulmichl demonstrated this by examining the work of musical prodigy, Mozart – who Paulmichl noted is appreciated greatly by both music scientists, as well as the common music listener.
“When Mozart makes choices on writing a piece of work, the number of choices that he has and the kind of neural algorithms that his brain goes through to choose things is infinitesimally higher that what we call big data – it’s really small data in comparison,” he said.
Taking Mozart’s The Magic Flute as an example, Paulmichl, discussed the framework that Mozart used to make his choices by examining a music sheet outlining the number of bars, the time signature, the instrument and singer voicing.
“So from his perspective, he sits down, and starts to make what we as data scientists call quantitative choices,” explained Paulmichl. “Do I put a note here, down here, do I use a different instrument; do I use a parallel voicing for different violins – so these are all metrics that his brain has to decide.”
Exploring the mathematics of the music, Paulmichl concluded that in looking at The Magic Flute, Mozart had 4.72391E+21 creative variations (and then some) that he could have taken with the direction of it over the course of the piece. “We’re not talking about a trillion dataset; we’re talking about a sextillion or more,” he says adding that this is a very limited cut of the quantitative choice that his brain makes at every composition point.
“[A] sextillion or more…” puts the question of processing a trillion triples into perspective.
Another musical analogy?
Triples are the one finger version of Jingle Bells*:
*The gap is greater than the video represents but it is still amusing.
If you fancy your application as handling data at velocity with a capital V, you need to see the movie of half a second of stock trades.
The rate is slowed down so you can see the trades at millisecond intervals.
From the post:
In the movie, one can observe how High Frequency Traders (HFT) jam thousands of quotes at the millisecond level, and how every exchange must process every quote from the others for proper trade through price protection. This complex web of technology must run flawlessly every millisecond of the trading day, or arbitrage (HFT profit) opportunities will appear. However, it is easy for HFTs to cause delays in one or more of the connections between each exchange. Yet if any of the connections are not running perfectly, High Frequency Traders tend to profit from the price discrepancies that result.
Jason reviews presentations at a recent Data Science MD meeting:
Starting off the night, Joey Echeverria, a Principal Solutions Architect, first discussed a big data architecture and how a key components of relational data management system can be replaced with current big data technologies. With Twitter being increasingly popular with marketing teams, analyzing Twitter data becomes a perfect use case to demonstrate a complete big data pipeline.
(…)
Following Joey, Sean Busbey, a Solutions Architect at Cloudera, discussed working with Mahout, a scalable machine learning library for Hadoop. Sean first introduced the three C’s of machine learning: classification, clustering, and collaborative filtering. With classification, learning from a training set supervised, and new examples can be categorized. Clustering allows examples to be grouped together with common features, while collaborative filtering allows new candidates to be suggested.
Great summaries, links to additional resources and the complete slides.
In the last decade a new generation of telescopes and sensors has allowed the production of a very large amount of data and astronomy has become, a data-rich science; this transition is often labeled as: “data revolution” and “data tsunami”. The first locution puts emphasis on the expectations of the astronomers while the second stresses, instead, the dramatic problem arising from this large amount of data: which is no longer computable with traditional approaches to data storage, data reduction and data analysis. In a new, age new instruments are necessary, as it happened in the Bronze age when mankind left the old instruments made out of stone to adopt the new, better ones made with bronze. Everything changed, even the social structure. In a similar way, this new age of Astronomy calls for a new generation of tools and, for a new methodological approach to many problems, and for the acquisition of new skills. The attempts to find a solution to this problems falls under the umbrella of a new discipline which originated by the intersection of astronomy, statistics and computer science: Astroinformatics, (Borne, 2009; Djorgovski et al., 2006).
Talend today announced the availability of version 5.3 of its next-generation integration platform, a unified environment that scales the integration of data, application and business processes. With version 5.3, Talend allows any integration developer to develop on big data platforms without requiring specific expertise in these areas.
“Hadoop and NoSQL are changing the way people manage and analyze data, but up until now, it has been difficult to work with these technologies. The general lack of skillsets required to manage these new technologies continues to be a significant barrier to mainstream adoption,” said Fabrice Bonan, co-founder and chief technical officer, Talend. “Talend v5.3 delivers on our vision of providing innovative tools that hide the underlying complexity of big data, turning anyone with integration skills into expert big data developers.”
User-Friendly Tools for 100 Percent MapReduce Code
Talend v5.3 generates native Hadoop code and runs data transformations directly inside Hadoop for scalability. By leveraging MapReduce’s architecture for highly distributed data processing, data integration developers can build their jobs on Hadoop without the need for specialist programming skills.
Graphical Mapper for Complex Processes
The new graphical mapping functionality targeting big data, and especially the Pig language, allows developers to graphically build data flows to take source data and transform it using a visual mapper. For Hadoop developers familiar with Pig Latin, this mapper enables them to develop, test and preview their data jobs within a GUI environment.
Additional NoSQL Support
Talend 5.3 adds support for NoSQL databases in its integration solutions, Talend Platform for Big Data and Talend Open Studio for Big Data, with a new set of connectors for Couchbase, CouchDB and Neo4j. Built on Talend’s open source integration technology, Talend Open Studio for Big Data is a powerful and versatile open source solution for big data integration that natively supports Apache Hadoop, including connectors for Hadoop Distributed File System (HDFS), HCatalog, Hive, Oozie, Pig, Sqoop, Cassandra, Hbase and MongoDB – in addition to the more than 450 connectors included natively in the product. The integration of these platforms into Talend’s big data solution enables customers to use these new connectors to migrate and synchronize data between NoSQL databases and all other data stores and systems.
Of particular interest is their data integration package, which reportedly sports 450+ connectors to various data sources.
Unless you are interested in coding all new connectors for the same 450+ data sources.
My previous startup, Unbound Concepts, created a machine learning algorithm that determined the textual complexity (e.g. reading level) of children’s literature. Our approach started as a natural language processing problem — designed to pull out language features to train our algorithms, and then quickly became a big data problem when we realized how much literature we had to go through in order to come up with meaningful representations. We chose to combine NLTK and Hadoop to create our Big Data NLP architecture, and we learned some useful lessons along the way. This series of posts is based on a talk done at the April Data Science DC meetup.
Think of this post as the Cliff Notes of the talk and the upcoming series of posts so you don’t have to read every word … but trust me, it’s worth it.
If you can’t wait for the future posts, Benjamin’s presentation from April is here. Amusing but fairly sparse slides.
These Virtual School courses will be delivered to sites nationwide using high-definition videoconferencing technologies, allowing students to participate at a number of convenient locations where they will be able to work with a cohort of fellow computational scientists, have access to local experts, and interact in real time with course instructors.
The Data Intensive Summer School focuses on the skills needed to manage, process, and gain insight from large amounts of data. It targets researchers from the physical, biological, economic, and social sciences who need to deal with large collections of data. The course will cover the nuts and bolts of data-intensive computing, common tools and software, predictive analytics algorithms, data management, and non-relational database models.
The Proven Algorithmic Techniques for Many-core Processors summer school will present students with the seven most common and crucial algorithm and data optimization techniques to support successful use of GPUs for scientific computing.
Studying many current GPU computing applications, the course instructors have learned that the limits of an application’s scalability are often related to some combination of memory bandwidth saturation, memory contention, imbalanced data distribution, or data structure/algorithm interactions. Successful GPU application developers often adjust their data structures and problem formulation specifically for massive threading and executed their threads leveraging shared on-chip memory resources for bigger impact. The techniques presented in the course can improve performance of applicable kernels by 2-10X in current processors while improving future scalability.
As I mentioned in my previous post, I plan to write a series of posts about study of large data sets, both the ways that high dimensional data has traditionally been studied and the topology that has recently been applied to this area. For anyone who has experience thinking about abstract geometric objects (as I assume most of the readers of this blog do) the concepts should seem pretty straightforward, and the difficulty is mostly in translation. So I will start with a post that focusses on defining terms. (Update: I’ve started a second blog The Shape of Data to look into these topics in more detail.)
You can also skip ahead to later posts in this series:
The study of large data sets in the abstract generally goes by two names: Data mining is the field that grew out of statistics, which considers ways to organize and summarize high dimensional data so that it can be understood by humans. Machine Learning is the subfield of computer science (particularly artificial intelligence) that looks for ways to have computers organize and summarize data, with the goal of having the computer make decisions. These two fields have a lot in common and I will not try to distinguish between them. There are also names for the application of these methods in different sciences, such as Bioinformatics and Cheminformatics. There are also rather notorious applications in marketing, which allow stores to know what you’re going to buy before you know.
We are given a collection of data, usually a set of ordered n-tuples, that come from a science experiment or surveys, or the data that retailers collect about you every time you use your credit card, etc. Some of the entries can be thought of as labels – the code number for the particular experiment, for example. The remaining coordinates/dimensions are often called features. If these features are numerical, then we can think of them as defining vectors in a Euclidean space, and this gives us our first glimpse of geometry. However, for high dimensional data, the Euclidean metric turns out to be problematic, so we will often want to use a different metric. The Euclidean metric is also problematic for binary features such as the presence of different genes in an organism.
FuturICT is a visionary project that will deliver new science and technology to explore, understand and manage our connected world. This will inspire new information and communication technologies (ICT) that are socially adaptive and socially interactive, supporting collective awareness.
Revealing the hidden laws and processes underlying our complex, global, socially interactive systems constitutes one of the most pressing scientific challenges of the 21st Century. Integrating complexity science with ICT and the social sciences, will allow us to design novel robust, trustworthy and adaptive technologies based on socially inspired paradigms. Data from a variety of sources will help us to develop models of techno-socioeconomic systems. In turn, insights from these models will inspire a new generation of socially adaptive, self-organised ICT systems. This will create a paradigm shift and facilitate a symbiotic co-evolution of ICT and society. In response to the European Commission’s call for a ‘Big Science’ project, FuturICT will build a largescale, pan European, integrated programme of research which will extend for 10 years and beyond.
Did you know that the term “semantic” appears only twice in the FuturICT Project Outline? And both times as in the “semantic web?”
Not a word of how models, data sources, paradigms, etc., with different semantics are going to be wedded into a coherent whole.
View it as an opportunity to deliver FuturlCT results using topic maps beyond this project.
In the Zipfian world of AK, the HyperLogLog distinct value (DV) sketch reigns supreme. This DV sketch is the workhorse behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the phenomenal paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. This sketch extends upon the earlier Loglog Counting of Large Cardinalities (Durand et. al.) which in turn is based on the seminal AMS workFM-85, Flajolet and Martin’s original work on probabilistic counting. (Many thanks to Jérémie Lumbroso for the correction of the history here. I am very much looking forward to his upcoming introduction to probabilistic counting in Flajolet’s complete works.) UPDATE – Rob has recently published a blog about PCSA, a direct precursor to LogLog counting which is filled with interesting thoughts. There have been a fewposts on HLL recently so I thought I would dive into the intuition behind the sketch and into some of the details.
Human classification alone, unable to handle the enormous quantity of project data, requires the support of automated machine-based strategies. In collaborative annotation, humans and machines work together, merging editorial strengths in semantics and pattern recognition with the machine strengths of scale and algorithmic power. Discovery informatics can be used to generate common data models, taxonomies and ontologies. A proposed project of massive scale, the Large Synoptic Survey Telescope (LSST) project, will systematically observe the southern sky over 10 years, collecting petabytes of data for analysis. The combined work of professional and citizen scientists will be needed to tag the discovered astronomical objects. The tag set will be generated through informatics and the collaborative annotation efforts of humans and machines. The LSST project will demonstrate the development and application of a classification scheme that supports search, curation and reuse of a digital repository.
A persuasive call to arms to develop “collaborative annotation:”
Humans and machines working together to produce the best possible classification label(s) is collaborative annotation. Collaborative annotation is a form of human computation [1]. Humans can see patterns and semantics (context, content and relationships) more quickly, accurately and meaningfully than machines. Human computation therefore applies to the problem of annotating, labeling and classifying voluminous data streams.
And more specifically for the Large Synoptic Survey Telescope (LSST):
The discovery potential of this data collection would be enormous, and its long-term value (through careful data management and curation) would thus require (for maximum scientific return) the participation of scientists and citizen scientists as well as science educators and their students in a collaborative knowledge mark-up (annotation and tagging) data environment. To meet this need, we envision a collaborative tagging system called AstroDAS (Astronomy Distributed Annotation System). AstroDAS is similar to existing science knowledge bases, such as BioDAS (Biology Distributed Annotation System, www.biodas.org).
As you might expect, semantic diversity is going to be present with “collaborative annotation.”
Semantic Monotony (aka Semantic Web) has failed for machines alone.
No question it will fail for humans + machines.
Are you ready to step up to the semantic diversity of collaborative annotation (humans + machines)?
In a recent blog post I mentioned the 4 reasons for using Hadoop for data science. In this blog post I would like to dive deeper into the last of these reasons: data agility.
In most existing data architectures, based on relational database systems, the data schema is of central importance, and needs to be designed and maintained carefully over the lifetime of the project. Furthermore, whatever data fits into the schema will be stored, and everything else typically gets ignored and lost. Changing the schema is a significant undertaking, one that most IT organizations don’t take lightly. In fact, it is not uncommon for a schema change in an operational RDBMS system to take 6-12 months if not more.
Hadoop is different. A schema is not needed when you write data; instead the schema is applied when using the data for some application, thus the concept of “schema on read”.
If a schema is supplied “on read,” how is data validation accomplished?
I don’t mean in terms of datatypes such as string, integer, double, etc. That are trivial forms of data validation.
How do we validate the semantics of data when a schema is supplied on read?”
Mistakes do happen in RDBMS systems but with a schema, which defines data semantics, applications can attempt to police those semantics.
I don’t doubt that schema “on read” supplies a lot of useful flexibility, but how do we limit the damage that flexibility can cause?
For example, many years ago, area codes (for telephones) in the USA were tied to geographic exchanges. Data from the era still exists in the bowels of some data stores. That is no longer true in many cases.
Let’s assume I have older data that has area codes tied to geographic areas and newer data that has area codes that are not. Without a schema to define the area code data in both cases, how would I know to treat the area code data differently?
I concede that schema “on read” can be quite flexible.
On the other hand, let’s not discount the value of schema “on write” as well.
Data from social media and Ushahidi-style crowdsourcing platforms have emerged as possible ways to leverage cellphones to prevent conflict. But in the world of Big Data, the amount of information generated from these is too small to use in advanced data-mining techniques and “machine-learning” techniques (where algorithms adjust themselves based on the data they receive).
But there is another way cellphones could be leveraged in conflict settings: through the various types of data passively generated every time a device is used. “Phones can know,” said Professor Alex “Sandy” Pentland, head of the Human Dynamics Laboratory and a prominent computational social scientist at MIT, in a Wall Street Journal article. He says data trails left behind by cellphone and credit card users—“digital breadcrumbs”—reflect actual behavior and can tell objective life stories, as opposed to what is found in social media data, where intents or feelings are obscured because they are “edited according to the standards of the day.”
The findings and implications of this, documented in several studies and press articles, are nothing short of mind-blowing. Take a few examples. It has been shown that it was possible to infer whether two people were talking about politics using cellphone data, with no knowledge of the actual content of their conversation. Changes in movement and communication patterns revealed in cellphone data were also found to be good predictors of getting the flu days before it was actually diagnosed, according to MIT research featured in the Wall Street Journal. Cellphone data were also used to reproduce census data, study human dynamics in slums, and for community-wide financial coping strategies in the aftermath of an earthquake or crisis.
Very interesting post on the potential uses for cell phone data.
You can imagine what I think could be correlated with cellphone data using a topic map so I won’t bother to enumerate those possibilities.
I did want to comment on the concern about privacy or re-identification as Emmanuel calls it in his post from cellphone data.
Governments, who have declared they can execute any of us without notice or a hearing, are the guardians of that privacy.
That causes me to lack confidence in their guarantees.
Discussions of privacy should assume governments already have unfettered access to all data.
The useful questions become: How do we detect their misuse of such data? and How do we make them heartily sorry for that misuse?
For cell phone data, open access will give government officials more reason for pause than the ordinary citizen.
Less privacy for individuals but also less privacy for access, bribery, contract padding, influence peddling, and other normal functions of government.
In the U.S.A., we have given up our rights to public trial, probable cause, habeas corpus, protections against unreasonable search and seizure, to be free from touching by strangers, and several others.
What’s the loss of the right to privacy for cellphone data compared to catching government officials abusing their offices?
We present a multi-purpose genetic algorithm, designed and implemented with GPGPU / CUDA parallel computing technology. The model was derived from our CPU serial implementation, named GAME (Genetic Algorithm Model Experiment). It was successfully tested and validated on the detection of candidate Globular Clusters in deep, wide-field, single band HST images. The GPU version of GAME will be made available to the community by integrating it into the web application DAMEWARE (DAta Mining Web Application REsource), a public data mining service specialized on massive astrophysical data. Since genetic algorithms are inherently parallel, the GPGPU computing paradigm leads to a speedup of a factor of 200x in the training phase with respect to the CPU based version.
In case you are curious about the application of genetic algorithms in a low signal/noise situation with really “big” data, this is a good starting point.
Makes me curious about the “noise” in other communications.
The “signal” is fairly easy to identify in astronomy, but what about in text or speech?
I suppose “background noise, music, automobiles” would count as “noise” on a tape recording of a conversation, but is there “noise” in a written text?
Or noise in a conversation that is clearly audible?
If we have 100% signal, how do we explain failing to understand a message in speech or writing?
Every 14 minutes, somewhere in the world, an ad exec strides on stage with the same breathless declaration:
“Data is the new oil!”
It’s exciting stuff for marketing types, and it’s an easy equation: big data equals big oil, equals big profits. It must be a helpful metaphor to frame something that is not very well understood; I’ve heard it over and over and over again in the last two years.
The comparison, at the level it’s usually made, is vapid. Information is the ultimate renewable resource. Any kind of data reserve that exists has not been lying in wait beneath the surface; data are being created, in vast quantities, every day. Finding value from data is much more a process of cultivation than it is one of extraction or refinement.
Jer’s last point, “more a process of cultivation than it is one of extraction or refinement,” and his last recommendation:
…we need to change the way that we collectively think about data, so that it is not a new oil, but instead a new kind of resource entirely.
resonates the most with me.
Everyone can apply the same processes to oil and get out largely the same results.
Data on the other hand, cannot be processed or analyzed until some user assigns it values.
Data and the results of analysis of data, have value only because of the assignment of meaning by some user.
imMens: Real-time Visual Querying of Big Data by Zhicheng Liu, Biye Jiangz and Jeffrey Heer.
Abstract:
Data analysts must make sense of increasingly large data sets, sometimes with billions or more records. We present methods for interactive visualization of big data, following the principle that perceptual and interactive scalability should be limited by the chosen resolution of the visualized data, not the number of records. We first describe a design space of scalable visual summaries that use data reduction methods (such as binned aggregation or sampling) to visualize a variety of data types. We then contribute methods for interactive querying (e.g., brushing & linking) among binned plots through a combination of multivariate data tiles and parallel query processing. We implement our techniques in imMens, a browser-based visual analysis system that uses WebGL for data processing and rendering on the GPU. In benchmarks imMens sustains 50 frames-per-second brushing & linking among dozens of visualizations, with invariant performance on data sizes ranging from thousands to billions of records.
The emphasis on “real-time” with “big data” continues.
Impressive work but I wonder if there is a continuum of “big data” for “real-time” access, analysis and/or visualization?
Some types of big data are simple enough for real-time analysis, but other types are less so and there are types of big data where real-time analysis is inappropriate.
What I don’t know is what factors you would evaluate to place one big data set at one point on that continuum and another data set at another. Closer to one end or the other.
Research that you are aware of on the appropriateness of “real-time” analysis of big data?
Specifically, a Big Data system has four properties:
It uses local storage to be fast but inexpensive
It uses clusters of commodity hardware to be inexpensive
It uses free software to be inexpensive
It is open source to avoid expensive vendor lock-in
It has been raining all day but I had to laugh when I saw Russell’s definition of “a Big Data system.”
Does it remind you of any particular player in the Big Data pack?
That’s one way to build marketshare, you define yourself to be the measuring stick.
Let’s walk through the list and see what comments or alternatives suggest themselves:
It uses local storage to be fast but inexpensive
[What? No cloud? Have you compared all the cost of local hardware against the cloud?]
It uses clusters of commodity hardware to be inexpensive
[Wonder why NCSA build Blue Waters "from Cray hardware, operates at a sustained performance of more than 1 petaflop (1 quadrillion calculations per second) and is capable of peak performance of 11.61 petaflops (11.6 quadrillion calculations per second)." Must not be "big data.]
It uses free software to be inexpensive
[They say that so often. I wonder what they are using as a basis for comparison? LaTeX versus MS Word? Have you paid anyone to typeset a paper in LaTeX versus asking your staff to type it in MS Word?]
It is open source to avoid expensive vendor lock-in
[Actually it is open formats that avoid vendor lock-in, expensive or otherwise]
I enjoy a bit of marketing fluff as much as the next person but it should at least be plausible.
Simpson’s paradox is best illustrated by the University of California, Berkeley sex discrimination case. Taken in the aggregate, admissions to the graduate school appeared to greatly favor men. Taken by department, no department discriminated against women and most favored admission of women. Same data, different level of examination. That is Simpson’s paradox.
Abstract:
This article describes an applet that facilitates investigation of Simpson’s Paradox in the context of a number of real and hypothetical data sets. The applet builds on the Baker-Kramer graphical representation for Simpson’s Paradox. The implementation and use of the applet are explained. This is followed by a description of how the applet has been used in an introductory statistics class and a discussion of student responses to the applet.
In probability and statistics, Simpson’s paradox, or the Yule–Simpson effect, is a paradox in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. This result is often encountered in social-science and medical-science statistics,[1] and is particularly confounding when frequency data are unduly given causal interpretations.[2] Simpson’s Paradox disappears when causal relations are brought into consideration.
A cautionary tale about the need to understand data sets and how combining them may impact outcomes of statistical analysis.
The Journal of Statistics Education (JSE) disseminates knowledge for the improvement of statistics education at all levels, including elementary, secondary, post-secondary, post-graduate, continuing, and workplace education. It is distributed electronically and, in accord with its broad focus, publishes articles that enhance the exchange of a diversity of interesting and useful information among educators, practitioners, and researchers around the world. The intended audience includes anyone who teaches statistics, as well as those interested in research on statistical and probabilistic reasoning. All submissions are rigorously refereed using a double-blind peer review process.
Manuscripts submitted to the journal should be relevant to the mission of JSE. Possible topics for manuscripts include, but are not restricted to: curricular reform in statistics, the use of cooperative learning and projects, innovative methods of instruction, assessment, and research (including case studies) on students’ understanding of probability and statistics, research on the teaching of statistics, attitudes and beliefs about statistics, creative and tested ideas (including experiments and demonstrations) for teaching probability and statistics topics, the use of computers and other media in teaching, statistical literacy, and distance education. Articles that provide a scholarly overview of the literature on a particular topic are also of interest. Reviews of software, books, and other teaching materials will also be considered, provided these reviews describe actual experiences using the materials.
In addition JSE also features departments called “Teaching Bits: A Resource for Teachers of Statistics” and “Datasets and Stories.” “Teaching Bits” summarizes interesting current events and research that can be used as examples in the statistics classroom, as well as pertinent items from the education literature. The “Datasets and Stories” department not only identifies interesting datasets and describes their useful pedagogical features, but enables instructors to download the datasets for further analysis or dissemination to students.
Associated with the Journal of Statistics Education is the JSE Information Service. The JSE Information Service provides a source of information for teachers of statistics that includes the archives of EDSTAT-L (an electronic discussion list on statistics education), information about the International Association for Statistical Education, and links to many other statistics education sources.
If you are going to talk about big data, of necessity you are also going to talk about statistics.
Real-Time, Granular, Online Access to Complex Manuals Improves Efficiency and Transparency While Reducing Costs
MarkLogic Corporation, the provider of the MarkLogic® Enterprise NoSQL database, today announced that the U.S. Patent and Trademark Office (USPTO) has launched the Reference Document Management Service (RDMS), which uses MarkLogic for real-time searching of detailed, specific, up-to-date content within patent and trademark manuals. RDMS enables real-time search of the Manual of Patent Examining Procedure (MPEP) and the Trademark Manual of Examination Procedures (TMEP). These manuals provide a vital window into the complexities of U.S. patent and trademark laws for inventors, examiners, businesses, and patent and government attorneys.
The thousands of examiners working for USPTO need to be able to quickly locate relevant instructions and procedures to assist in their examinations. The RDMS is enabling faster, easier searches for these internal users.
Having the most current materials online also means that the government can reduce reliance on printed manuals that quickly go out of date. USPTO can also now create and publish revisions to its manuals more quickly, allowing them to be far more responsive to changes in legislation.
Additionally, for the first time ever, the tool has also been made available to the public increasing the MPEP and TMEP accessibility globally, furthering the federal government’s efforts to promote transparency and accountability to U.S. citizens. Patent creators and their trusted advisors can now search and reference the same content as the USPTO examiners, in real time — instead of having to thumb through a printed reference guide.
The date on this report was March 26, 2013.
I don’t know if the USPTO is just playing games but searching their site for “Reference Document Management Service” produces zero “hits.”
Searching for “RDMS” produces four (4) “hits,” none of which were pointers to an interface.
Maybe it was too transparent?
The value-add proposition I was going to suggest was mapping the results of searching into some coherent presentation, like TaxMap.
And/or linking the results of searches into current literature in rapidly developing fields of technology.
Guess both of those opportunities will have to wait for basic searching to be available.
If you have a status update on this announced but missing project please ping me.
One of the key challenges in making use of Big Data lies in finding ways of dealing with heterogeneity, diversity, and complexity of the data, while its volume and velocity forbid solutions available for smaller datasets as based, e.g., on manual curation or manual integration of data. Semantic Web Technologies are meant to deal with these issues, and indeed since the advent of Linked Data a few years ago, they have become central to mainstream Semantic Web research and development. We can easily understand Linked Data as being a part of the greater Big Data landscape, as many of the challenges are the same. The linking component of Linked Data, however, puts an additional focus on the integration and conflation of data across multiple sources.
Workshop Topics
In this symposium, we will explore the many opportunities and challenges arising from transferring and adapting Semantic Web Technologies to the Big Data quest. Topics of interest focus explicitly on the interplay of Semantics and Big Data, and include:
the use of semantic metadata and ontologies for Big Data,
the use of formal and informal semantics,
the integration and interplay of deductive (semantic) and statistical methods,
methods to establish semantic interoperability between data sources
ways of dealing with semantic heterogeneity,
scalability of Semantic Web methods and tools, and
semantic approaches to the explication of requirements from eScience applications.
The W3C is late to the party as evidenced by semantic heterogeneity becoming “…central to mainstream Semantic Web research and development” after the advent of Linked Data.
I suppose better late than never.
At least if they remember that:
Users experience semantic heterogeneity in data and in the means used to describe and store data.
Whatever solution is crafted, its starting premise must be to capture semantics as seen by some defined user.
Otherwise, it is capturing the semantics of designers, authors, etc., which may or may not be valuable to some particular user.
RDF is a good example of capturing someone else’s semantics.
As the volume of data stored in the enterprise continues to grow, organizations see this information as representing a substantial portion of their assets. With tools such as Hadoop for Windows, businesses are unlocking the value of this data, Anthony Saxby, Microsoft U.K.’s data platform product marketing manager, said in a recent talk at Computing’s Big Data Summit 2013. According to Microsoft’s research, half of all organizations think their data represents 50 to 75 percent of their total value.
The challenge in unlocking this value is technology, Saxby said, according to Computing. Much of this information is internally siloed or separated from the external data sources that it could be combined with to create more effective, monetized results. Today’s businesses want to bring together unstructured and structured data to create new insights. With tools such as Hadoop, this type of analysis is increasingly possible. For instance, record label EMI uses a variety of data types across 25 countries to determine how to market music artists in different geographies.
The headline reminded me of Bilbo Baggins:
I don’t know half of you half as well as I should like; and I like less than half of you half as well as you deserve.
As the narrator notes:
This was unexpected and rather difficult.
I don’t follow the WSJ as closely as some but what of inventories, brick and mortar assets, accounts receivable, employees, IP, etc.?
Not that I doubt the value of data.
I do doubt the ability of businesses that manage by catch phrases like “big data,” “silos,” “unstructured and structured data,” Hadoop,” to realize its value.
Hadoop will figure in successful projects to “unlock data,” but only where it is used as a tool and not a magic bullet.
A clear understanding of data and its sources, how to measure ROI from its use, are only two of the keys to successful use of any data tool.
Pilling up data freed from internal silos upon data from external sources results in a big heap of data.
Impressive to the uninformed but it won’t increase your bottom line.
An excellent article in the Wall Street Journal, “Big Data, Big Blunders,” discussed five mistakes commonly made by enterprises when initiating their first Big Data projects. The technology hype cycle, which reminds me a lot of The Wizard of Oz, is a contributing factor in these blunders. I’ll briefly summarize the WSJ’s points, and will suggest, based on my experience helping clients, why enterprises make these blunders.
Rick summarizes these points from the WSJ story:
Data for Data’s Sake
Talent Gap
Data, Data Everywhere
Infighting
Aiming Too High
Rick says that advocates of new technologies promise to solve problems with prior technology advances, leading to unrealistic expectations.
I agree but there is a persistent failure to recognize the uncertainty principle for data.
How would you know if data is clean and uniform?
By your use case for the data. Yes?
That would explain why data scientists estimate they spend 60-80% of their time munging data (cleaning, transforming, etc.).
They are making data clean and uniform for their individual use cases.
And they do that task over and over again.
The definition of clean and uniform data is like the uncertainty principle in physics.
You can have clean and uniform data for one purpose, but making it so makes it dirty and non-uniform for another purpose.
Unless a technology outlines how it obtains clean and uniform data, from its perspective, it has told you only part of the cost of its use.
NSF: Summary Submission Deadline – April 22, 2013.
Aiming to make the most of the explosion of Big Data and the tools needed to analyze it, the Obama Administration announced a "National Big Data Research and Development Initiative" on March 29, 2012. To launch the initiative, six Federal departments and agencies announced more than $200 million in new commitments that, together, promise to greatly improve and develop the tools, techniques, and human capital needed to move from data to knowledge to action. The Administration is also working to "liberate" government data and voluntarily-contributed corporate data to fuel entrepreneurship, create jobs, and improve the lives of Americans in tangible ways.
As we enter the second year of the Big Data Initiative, the Administration is encouraging multiple stakeholders including federal agencies, private industry, academia, state and local government, non-profits, and foundations, to develop and participate in Big Data innovation projects across the country. Later this year, the Office of Science and Technology Policy (OSTP), NSF, and other agencies in the Networking and Information Technology R&D (NITRD) program plan to convene an event that highlights high-impact collaborations and identifies areas for expanded collaboration between the public and private sectors. The Administration is particularly interested in projects and initiatives that:
Advance technologies that support Big Data and data analytics;
Educate and expand the Big Data workforce;
Develop, demonstrate and evaluate applications of Big Data that improve key outcomes in economic growth, job creation, education, health, energy, sustainability, public safety, advanced manufacturing, science and engineering, and global development;
Demonstrate the role that prizes and challenges can play in deriving new insights from Big Data; and
Foster regional innovation.
Please submit a two-page summary of projects to BIGDATA@nsf.gov. The summary should identify:
The goal of the project, with metrics for evaluating the success or failure of the project;
The multiple stakeholders that will participate in the project and their respective roles and responsibilities;
Initial financial and in-kind resources that the stakeholders are prepared to commit to this project; and
A principal point of contact for the partnership.
The submission should also indicate whether the NSF can post the project description to a public website. This announcement is posted solely for information and planning purposes; it does not constitute a formal solicitation for grants, contracts, or cooperative agreements.
Doesn’t look like individuals are included, “…federal agencies, private industry, academia, state and local government, non-profits, and foundations….”
Does anyone have a government or non-profit I could borrow to propose a topic map-based Big Data innovation project?
Thanks!
Phrased humorously but that’s a serious request.
I have a deep interest in the promotion of topic maps and funded projects are a good type of promotion.
Other people see a topic map-based project getting funded and they think having a topic map was part of being funded. Creating more topic map-based applications and hence a chance at more topic map-based projects being funded.
Much of today’s statistical modeling and predictive analytics is beautiful but unique. It’s impossible to repeat, it’s snowflake data science. (Matt Wood, principal data scientist for Amazon Web Services)
Think about that for a moment.
Snowflakes are unique. Can the same be said about your data science projects?
Would that explain the 80% figure of data science time being spent on cleaning, ETL, and similar tasks with data?
Is it that data never gets clean or are you cleaning the same data over and over again?
The next frontier is making that data reproducible, said Matt Wood, principal data scientist for Amazon Web Services, at GigaOM’s Structure:Data 2013 event Wednesday.
In short, it’s great to get a result from your number crunching, but if the result is different next time out, there’s a problem. No self-respecting scientist would think of submitting the findings for a trial or experiment unless she is able to show that the it will be the same after multiple runs.
“Much of today’s statistical modeling and predictive analytics is beautiful but unique. It’s impossible to repeat, it’s snowflake data science.” Wood told attendees in New York. “Reproducibility becomes a key arrow in the quiver of the data scientist.”
The next frontier is making sure that people can reproduce, reuse and remix their data which provides a “tremendous amount of value,” Wood noted. (emphasis added)
I like that: Reproduce, Reuse, Remix data.
That’s going to require robust and granular handling of subject identity.
A more detailed recap will follow soon but here’s a very quick hats off to the about 150 data scientists, civic hackers, visual analytics savants, poverty specialists, and fraud/anti-corruption experts that made the Big Data Exploration at Washington DC over the weekend such an eye-opener.We invite you to explore the work that the volunteers did (these are rough documents and will likely change as you read them so it’s okay to hold off if you would rather wait for a ‘final’ consolidated document). The projects that the volunteers worked on include:
Scraping Websites to Collect Consumption and Price Data – what can researchers studying poverty in countries learn from openly available crowdsourced daily price data, and by scraping price data from supermarket websites?
Analyzing World Bank Supplier Profiles – can the Bank and other agencies include publicly available data to gain a broader, more comprehensive understanding of their suppliers and use the information as proxies for risk management?
UNDP Resource Allocation – can UNDP use staffing and program budget data to infer what skillsets mix and match the best in projects?
Great meeting and projects but I would suggest a different sort of “big data”
Requiring recipients to grant reporting access to all bank accounts where funds will be transferred and requiring the same for any entity paid out of those accounts to the point where transfers over 90 days are less than $1,000 for any entity (or related entity), would be a better start.
With the exception of the “related entity” information, banks already keep transfer of funds information as a matter of routine business. It would be “big data” that is rich in potential for spotting fraud and waste.
The reporting banks should also be required to deliver other banking records they have on the accounts where funds are transferred and other activity in those accounts.
Before crying “invasion of privacy,” remember World Bank funding is voluntary.
As is acceptance of payment from World Bank funded projects. Anyone and everyone is free to decline such funding and avoid the proposed reporting requirements.
“Big data” to track fraud and waste is already collected by the banking industry.
The question is whether we will use that “big data” to effectively track fraud and waste or wait for particularly egregious cases to come to light?