One Week of Harassment on Twitter

January 29th, 2015

One Week of Harassment on Twitter by Anita Sarkeesian.

From the post:

Ever since I began my Tropes vs Women in Video Games project, two and a half years ago, I’ve been harassed on a daily basis by irate gamers angry at my critiques of sexism in video games. It can sometimes be difficult to effectively communicate just how bad this sustained intimidation campaign really is. So I’ve taken the liberty of collecting a week’s worth of hateful messages sent to me on Twitter. The following tweets were directed at my @femfreq account between 1/20/15 and 1/26/15.

The limited vocabularies of the posters to one side, one hundred and fifty-six (156) hate messages is an impressive number. I pay no more attention to postings by illiterates than I do to cat pictures but I can understand why that would get to be a drag.

Many others have commented more usefully on the substance of this topic than I can but as a technical matter, how would you:

  • Begin to ferret out the origins and backgrounds on these posters?
  • Automate response networks (use your imagination about the range of responses)?
  • Automate filtering for an account under such attacks?

Lacking any type of effective governance structure, think any unexplored and ungoverned territory, security and safety on the Internet is a question of alliances for mutual protection. Eventually governance will evolve for the Internet but since that will require relinquishing of some national sovereignty, I don’t expect to see it in our lifetimes.

In the meantime, we need stop-gap measures that can set the tone for the governance structures that will eventually evolve.

Suggestions?

PS: Some people urge petitioning current governments for protection. Since their interests are in inherent conflict with the first truly transnational artifact (the Internet), I don’t see that as being terribly useful. I prefer whatever other stick comes to hand.

I first saw this in a tweet by kottke.org.

MapR Offers Free Hadoop Training and Certifications

January 29th, 2015

MapR Offers Free Hadoop Training and Certifications by Thor Olavsrud.

From the post:

In an effort to make Hadoop training for developers, analysts and administrators more accessible, Hadoop distribution specialist MapR Technologies Tuesday unveiled a free on-demand training program. Another track for HBase developers will be added later this quarter.

“This represents a $50 million, in-kind contribution to the Hadoop community,” says Jack Norris, CMO of MapR. “The focus is overcoming what many people consider the major obstacle to the adoption of big data, particularly Hadoop.”

The developer track is about building big data applications in Hadoop. The topics range from the basics of Hadoop and related technologies to advanced topics like designing and developing MapReduce and HBase applications with hands-on labs. The courses include:

  • Hadoop Essentials. This course, which is immediately available, provides an introduction to Hadoop, the ecosystem, common solutions and use cases.
  • Developing Hadoop Applications. This course is also immediately available and focuses on designing and writing effective Hadoop applications with MapReduce and YARN.
  • HBase Schema Design and Modeling. This course will become available in February and will focus on architecture, schema design and data modeling on HBase.
  • Developing HBase Applications. This course will also debut in February and focuses on real-world application design in HBase (Time Series and Social Application examples).
  • Hadoop Data Analysis – Drill. Slated for debut in March, this course covers interactive SQL on Hadoop for structured, semi-structured and nested data.

I remember how expensive the Novell training classes were back in the Netware 4.11 days. (Yes, that has been a while.)

I wonder whose software will come to mind after completing the MapR training courses and passing the certification exams?

That’s what I think too. Send kudos to MapR for this effort!

Looking forward to seeing some of you at Hadoop certification exams later this year!

I first saw this in a tweet by Kirk Borne.

‘Open Up’ Digital Democracy Commission’s Report published

January 29th, 2015

‘Open Up’ Digital Democracy Commission’s Report published

From the post:

The Speaker’s Commission on Digital Democracy has published its report ‘Open Up’. The report recommends how Parliament can use digital technology to help it to be more transparent, inclusive, and better able to engage the public with democracy.

  • Read the Digital Democracy Commission’s full report
  • Read the Summary of the report
  • Read the plain language version of the report (PDF 355 KB)
  • Information on events for the launch of the report
  • Commenting on the report, the Rt Hon John Bercow MP, Speaker of the House of Commons said:

    “I set up the Digital Democracy Commission to explore how Parliament could make better use of digital technology to enhance and improve its work. I am very grateful to all those who contributed to the Commission’s work, and have been particularly struck by the enthusiastic contributions from those who expressed a desire to participate in the democratic process, but felt that barriers existed that prevented them from doing so.

    This report provides a comprehensive roadmap to break down barriers to public participation. It also makes recommendations to facilitate better scrutiny and improve the legislative process.

    In a year where we reflect on our long democratic heritage, it is imperative that we look also to the future and how we can modernise our democracy to meet the changing needs of modern society.”

    … (emphasis in the original)

    Do you think I should forward the U.S. Congress the full report or the plain language summary? ;-)

    I was particularly encouraged by the methodology of the report:

    We asked people to tell us their views online or in person and we heard from a wide a range of people. They included not just experts, MPs and interest groups, but members of the public—people of different ages and backgrounds and people with varying levels of interest in politics and the work of Parliament.

    I wonder if that has ever occurred to the various groups drafting standards for IT in government? To ask an actual citizen? They aren’t rare or so I have been told.

    What ever sort of government you want or want to preserve, a good lesson in how to “feel the pulse” that drives the average citizen. Even more useful if you are interested in democratic institutions.

    PS: There is an IT rumor that the Texas tried legislative transparency a number of years ago, for maybe a day or two. It was so transparent and disruptive of the usual skullduggery of the legislature that they jerked the system. I heard the story from more than one very reliable source with first hand knowledge of the project. I suspect there is documentation in the possession of some office at the Texas legislature to corroborate that rumor. Anyone feeling leaky?

    If you put the Texas legislature in jail, the odds of imprisoning an innocent depend on whether it was bring your child to work day or not.

    WorldWideScience.org (Update)

    January 28th, 2015

    I first wrote about WorldWideScience.org in a post dated October 17, 2011.

    A customer story from Microsoft: WorldWide Science Alliance and Deep Web Technologies made me revisit the site.

    My original test query was “partially observable Markov processes” which resulted in 453 “hits” from at least 3266 found (2011 results). Today, running the same query resulted in “…1,342 top results from at least 25,710 found.” The top ninety-seven (97) were displayed.

    A current description of the system from the customer story:


    In June 2010, Deep Web Technologies and the Alliance launched multilingual search and translation capabilities with WorldWideScience.org, which today searches across more than 100 databases in more than 70 countries. Users worldwide can search databases and translate results in 10 languages: Arabic, Chinese, English, French, German, Japanese, Korean, Portuguese, Russian, and Spanish. The solution also takes advantage of the Microsoft Audio Video Indexing Service (MAVIS). In 2011, multimedia search capabilities were added so that users could retrieve speech-indexed content as well as text.

    The site handles approximately 70,000 queries and 1 million page views each month, and all traffic, including that from automated crawlers and search engines, amounts to approximately 70 million transactions per year. When a user enters a search term, WorldWideScience.org instantly provides results clustered by topic, country, author, date, and more. Results are ranked by relevance, and users can choose to look at papers, multimedia, or research data. Divided into tabs for easy usability, the interface also provides details about each result, including a summary, date, author, location, and whether the full text is available. Users can print the search results or attach them to an email. They can also set up an alert that notifies them when new material is available.

    Automated searching and translation can’t give you the semantic nuances possible by human authoring but it certainly can provide you with the source materials to build a specialized information resource with such semantics.

    Very much a site to bookmark and use on a regular basis.

    Links for subjects without them otherwise:

    Deep Web Technologies

    Microsoft Translator

    Bughunter cracks “absolute privacy” Blackphone – by sending it a text message

    January 28th, 2015

    Bughunter cracks “absolute privacy” Blackphone – by sending it a text message by Paul Ducklin.

    From the post:

    Serial Aussie bugfinder Mark Dowd has been at it again.

    He loves to look for security flaws in interesting and important places.

    This time, he turned his attention to a device that most users acquired precisely because of its security pedigree, namely the Blackphone.

    What Dowd found is that text messages received by a Blackphone are processed by the messaging software in an insecure way that could lead to remote code execution.

    Simply put, the sender of a message can format it so that instead of being decoded and displayed safely as text, the message tricks the phone into processing and executing it as if it were a miniature program.

    Dowd’s paper is a great read if you’re a programmer, because it explains the precise details of how the exploit works, which just happens to make it pretty obvious what the programmers did wrong.

    That means his article can help you avoid this sort of error in your own code.

    Don’t get too excited because Blackphone has already issued a patch for the problem.

    On the other hand, Paul’s lay explanation of the exploit could lead to a hard copy demonstration of the bug for educating purchasers of programming services. Imagine a contract that specifies the resulting software is free from this specific type of defect. That can only happen with better educated consumers of software programming services.

    Are there existing hard copy demonstrations of common software bugs? Where a person can file out a common form such as Paul’s change of address and see the problem with the data they have entered?

    Beyond this particular exploit, what other common exploits are subject to similar analogies?

    This could be an entirely new market for security based educational materials, particularly for online and financial communities.

    NIST developing database to help advance forensics

    January 27th, 2015

    NIST developing database to help advance forensics by Greg Otto.

    From the post:

    While the National Institute of Standards and Technology has been spending a lot of time advancing the technology behind forensics, the agency can so only go so far. With all of the ways people can be identified, researchers still lack sufficient data that would allow them to further already existing technology.

    To overcome that burden, NIST has been working on a catalog that would help the agency, academics and other interested parties discover data sets that will allow researchers to further their work. The Biometric and Forensic Research Database Catalog aims to be a one-stop shop for those looking to gather enough data or find better quality data for their projects.

    Not all national agencies in the United States do a bad job. Some of them, NIST being among them, do very good jobs.

    Take the Biometric and Forensic Research Database Catalog (BDbC) for example. Forensic data is hard to find and to cure that problem, NIST has created a curated data collection that is available for anyone to search online.

    Perhaps the U.S. Axis of Surveillance (FBI/DEA/CIA/NSA, etc.) don’t understand the difference between a data vacuum cleaner and a librarian. Any fool can run a data vacuum cleaner, fortunately or the Axis of Surveillance would have no results at all.

    Fortunately, Erica Firment can help the Axis of Surveillance with the difference:

    Why you should fall to your knees and worship a librarian

    Ok, sure. We’ve all got our little preconceived notions about who librarians are and what they do.

    Many people think of librarians as diminutive civil servants, scuttling about “Sssh-ing” people and stamping things. Well, think again buster.

    Librarians have degrees. They go to graduate school for Information Science and become masters of data systems and human/computer interaction. Librarians can catalog anything from an onion to a dog’s ear. They could catalog you.

    Librarians wield unfathomable power. With a flip of the wrist they can hide your dissertation behind piles of old Field and Stream magazines. They can find data for your term paper that you never knew existed. They may even point you toward new and appropriate subject headings.

    People become librarians because they know too much. Their knowledge extends beyond mere categories. They cannot be confined to disciplines. Librarians are all-knowing and all-seeing. They bring order to chaos. They bring wisdom and culture to the masses. They preserve every aspect of human knowledge. Librarians rule. And they will kick the crap out of anyone who says otherwise.

    Everybody has a favorite line but mine is:

    People become librarians because they know too much.

    There is a corollary which Erica doesn’t mention:

    People resort to data vacuuming because they know too little. A condition that data vacuuming cannot fix.

    Think of it as being dumb and getting dumber.

    There are solutions to that problem but since the intelligence community isn’t paying me, it isn’t worth writing them down.

    PS: Go to the Library Avengers store for products by Erica.

    The DEA is Stalking You!

    January 27th, 2015

    When I wrote about Waze earlier today, Google asked to muzzle Waze ‘police-stalking’ app, I had no idea that the Wall Street Journal had dropped the hammer on yet another mass surveillance program.

    In U.S. Justice Department Built Secret, Nationwide License-Plate Tracking Database (Car & Driver) by Robert Sorokanich, Robert reports:

    Bad news for anyone who values privacy: The Wall Street Journal reports that the U.S. Justice Department has been secretly expanding its license-plate scanning program to create a real-time national vehicle tracking database monitoring hundreds of millions of motorists.

    WSJ pulls no punches in describing the program, calling it nothing less than “a secret domestic intelligence-gathering program.” The program, established by the Drug Enforcement Agency in 2008, originated as a way of tracking down and seizing cars, money, and other assets involved in drug trafficking in areas of Arizona, California, Nevada, New Mexico, and Texas where illicit drugs are funneled across the border.

    The program uses camera systems at strategic points on major U.S. highways to record time, location, and direction of vehicle travel. Some locations take photos of drivers and passengers, which are sometimes detailed enough to confirm identity, WSJ reports.

    Perhaps more chillingly, the documents reviewed by the news outlets indicate that the DEA has also employed license-plate-reading technology to create a “far-reaching, constantly updating database of electronic eyes scanning traffic on the roads to steer police toward suspects.”

    My first reaction was as a close friend often says, “…it’s hard to be cynical enough.”

    My second reaction is that I need to get a WSJ subscription so I can check for late breaking news in areas of interest.

    But more to the point, we should all start tracking all police, everywhere and posting that data to Waze. In particular we need to track DEA, FBI, NSA, CIA, all elected and appointed federal officials, etc.

    You see, I happen to trust the town and county police where I live. Not to mention the state police almost that much. I’m sure they would disagree politically with some of the things I say but for the most part, they are doing a thankless job for lower pay that I would take for the same work. Where my trust of the police and government breaks down is once you move off of the state level.

    Not to deny there are bad apples in every lot, but as you go up to the national level, the percentage of bad apples increases rapidly. What agenda they are seeking to serve I cannot say but I do know it isn’t one that is consistent with the Constitution or intended to benefit any ordinary citizens.

    Turn your cellphone cameras on and legally park outside every known DEA, FBI, etc. office and photograph everyone coming or going. Obey all laws and instructions from law enforcement officials. Then post all of your photos and invite others to do the same.

    Actually I would call up my local police and ask for their assistance in tracking DEA, FBI, etc. agents. The local police don’t need interference from people who don’t understand your local community. You may find the local police are your best allies in ferreting out overreaching by the federal government.

    The police (read local police) aren’t the privacy problem. The privacy problem is with federal data vacuums and police wannabes who think people are the sum of their data. People are so much more than that, ask your local police if you don’t believe me.

    Data Science and Hadoop: Predicting Airline Delays – Part 3

    January 27th, 2015

    Data Science and Hadoop: Predicting Airline Delays – Part 3 by Ofer Mendelevitch and Beau Plath.

    From the post:

    In our series on Data Science and Hadoop, predicting airline delays, we demonstrated how to build predictive models with Apache Hadoop, using existing tools. In part 1, we employed Pig and Python; part 2 explored Spark, ML-Lib and Scala.

    Throughout the series, the thesis, theme, topic, and algorithms were similar. That is, we wanted to dismiss the misconception that data scientists – when applying predictive learning algorithms, like Linear Regression, Random Forest or Neural Networks to large datasets – require dramatic changes to the tooling; that they need dedicated clusters; and that existing tools will not suffice.

    Instead, we used the same HDP cluster configuration, the same machine learning techniques, the same data sets, and the same familiar tools like PIG, Python and Scikit-learn and Spark.

    For the final part, we resort to Scalding and R. R is a very popular, robust and mature environment for data exploration, statistical analysis, plotting and machine learning. We will use R for data exploration, graphics as well as for building our predictive models with Random Forest and Gradient Boosted Trees. Scalding, on the other hand, provides Scala libraries that abstract Hadoop MapReduce and implement data pipelines. We demonstrate how to pre-process the data into a feature matrix using the Scalding framework.

    For brevity I shall spare summarizing the methodology here, since both previous posts (and their accompanying IPython Notebooks) expound the steps, iteration and implementation code. Instead, I would urge that you read all parts as well as try the accompanying IPython Notebooks.

    Finally, for this last installment in the series in Scaling and R, read its IPython Notebook for implementation details.

    Given the brevity of this post, you are definitely going to need Part 1 and Part 2.

    The data science world could use more demonstrations like this series.

    LDAvis: Interactive Visualization of Topic Models

    January 27th, 2015

    LDAvis: Interactive Visualization of Topic Models by Carson Sievert and Kenny Shirley.

    From the webpage:

    Tools to create an interactive web-based visualization of a topic model that has been fit to a corpus of text data using Latent Dirichlet Allocation (LDA). Given the estimated parameters of the topic model, it computes various summary statistics as input to an interactive visualization built with D3.js that is accessed via a browser. The goal is to help users interpret the topics in their LDA topic model.

    From the description:

    This video (recorded September 2014) shows how interactive visualization is used to help interpret a topic model using LDAvis. LDAvis is an R package which extracts information from a topic model and creates a web-based visualization where users can interactively explore the model. More details, examples, and instructions for using LDAvis can be found here — https://github.com/cpsievert/LDAvis

    Excellent exploration of a data set using LDAvis.

    Will all due respect to “agile” programming, modeling before you understand a data set isn’t a winning proposition.

    Eigenvectors and eigenvalues: Explained Visually

    January 27th, 2015

    Eigenvectors and eigenvalues: Explained Visually by Victor Powell and Lewis Lehe

    Very impressive explanation/visualization of eigenvectors and eigenvalues. What is more, it concludes with pointers to additional resources.

    This is only a part of a larger visualization of algorithms projects at: Explained Visually.

    Looking forward to seeing more visualizations on this site.

    Coding is not the new literacy

    January 27th, 2015

    Coding is not the new literacy by Chris Granger.

    From the post:

    Despite the good intentions behind the movement to get people to code, both the basic premise and approach are flawed. The movement sits on the idea that "coding is the new literacy," but that takes a narrow view of what literacy really is.

    If you ask google to define literacy it gives a mechanical definition:

    the ability to read and write.

    This is certainly accurate, but defining literacy as interpreting and making marks on a sheet of paper is grossly inadequate. Reading and writing are the physical actions we use to employ something far more important: external, distributable storage for the mind. Being literate isn't simply a matter of being able to put words on the page, it's solidifying our thoughts such that they can be written. Interpreting and applying someone else's thoughts is the equivalent for reading. We call these composition and comprehension. And they are what literacy really is.

    Before you assume that Chris is going to diss programming, go read his post.

    Chris is arguing for a skill set that will make anyone a much better programmer as well as spill over into other analytical tasks as well.

    Take the title as a provocation to read the post. By the end of the post, you will have learned something valuable or have been reminded of something valuable that you already knew.

    Enjoy!

    Business Analytics Error: Learn from Uber’s Mistake During the Sydney Terror Attack

    January 27th, 2015

    Business Analytics Error: Learn from Uber’s Mistake During the Sydney Terror Attack by RK Paleru.

    From the post:

    Recently, as a sad day of terror ended in Sydney, a bad case of Uber’s analytical approach to pricing came to light – an “algorithm based price surge.” Uber’s algorithm driven price surge started overcharging people fleeing the Central Business District (CBD) of Sydney following the terror attack.

    I’m not sure the algorithm got it wrong. If you asked me to drive into a potential war zone to ferry strangers out, I suspect a higher fee than normal is to be expected.

    The real dilemma for Uber is that not all ground transportation has surge price algorithms. When buses, subways, customary taxis, etc. all have surge price algorithms, the price hikes won’t appear to be abnormal.

    One of the consequences of an algorithm/data-driven world is that factors known or unknown to you may be driving the price or service. To say it another way, your “expectations” of system behavior may be at odds with how the system will behave.

    The inventory algorithm at my local drugstore thought a recent prescription was too unusual to warrant stocking. My drugstore had to order it from a regional warehouse. Just-in-time inventory I think they call it. That was five (5) days ago. That isn’t “just-in-time” for the customer (me) but that isn’t the goal of most cost/pricing algorithms. Particularly when the customer has little choice about the service.

    I first saw this in a tweet by Kirk Borne.

    Nature: A recap of a successful year in open access, and introducing CC BY as default

    January 27th, 2015

    A recap of a successful year in open access, and introducing CC BY as default by Carrie Calder, the Director of Strategy for Open Research, Nature Publishing Group/Palgrave Macmillan.

    From the post:

    We’re pleased to start 2015 with an announcement that we’re now using Creative Commons Attribution license CC BY 4.0 as default. This will apply to all of the 18 fully open access journals Nature Publishing Group owns, and will also apply to any future titles we launch. Two society- owned titles have introduced CC BY as default today and we expect to expand this in the coming months.

    This follows a transformative 2014 for open access and open research at Nature Publishing Group. We’ve always been supporters of new technologies and open research (for example, we’ve had a liberal self-archiving policy in place for ten years now. In 2013 we had 65 journals with an open access option) but in 2014 we:

    • Built a dedicated team of over 100 people working on Open Research across journals, books, data and author services
    • Conducted research on whether there is an open access citation benefit, and researched authors’ views on OA
    • Introduced the Nature Partner Journal series of high-quality open access journals and announced our first ten NPJs
    • Launched Scientific Data, our first open access publication for Data Descriptors
    • And last but not least switched Nature Communications to open access, creating the first Nature-branded fully open access journal

    We did this not because it was easy (trust us, it wasn’t always) but because we thought it was the right thing to do. And because we don’t just believe in open access; we believe in driving open research forward, and in working with academics, funders and other publishers to do so. It’s obviously making a difference already. In 2013, 38% of our authors chose to publish open access immediately upon publication – in 2014, this percentage rose to 44%. Both Scientific Reports and Nature Communications had record years in terms of submissions for publication.

    Open access is on its way to becoming the expected model for publishing. That isn’t to say that there aren’t economies and kinks to be worked out, but the fundamental principles of open access have been widely accepted.

    Not everywhere of course. There are areas of scholarship that think self-isolation makes them important. They shun open access as an attack on their traditions of “Doctor Fathers” and access to original materials as a privilege. Strategies that make them all the more irrelevant in the modern world. Pity because there is so much they could contribute to the public conversation. But a public conversation means you are not insulated from questions that don’t accept “because I say so” as an adequate answer.

    If you are working in such an area or know of one, press for emulation of the Nature and the many other efforts to provide open access to both primary and secondary materials. There are many areas of the humanities that already follow that model, but not all. Let’s keep pressing until open access is the default for all disciplines.

    Kudos to Nature for their ongoing efforts on open access.

    I first saw the news about the post about Nature in a tweet by Ethan White.

    Message of Ayatollah Seyyed Ali Khamenei To the Youth in Europe and North America

    January 27th, 2015

    #LETTER4U Message of Ayatollah Seyyed Ali Khamenei To the Youth in Europe and North America

    Unlike many news sources I will not attempt to analyze this message from Ayatollah Seyyed Ali Khamenei.

    You should read the message for yourself and not rely on the interpretations of others.

    Ayatollah Seyyed Ali Khamenei’s request is an honorable one and should be granted. You will find it an exercise in attempting (one never really succeeds) to understand the context of another. That is one of the key skills in creating topic maps that traverse the contextual boundaries of departments, enterprises, government offices and cultures.

    It isn’t easy to stray from one’s own cultural context but even making the effort is worthwhile.

    Google asked to muzzle Waze ‘police-stalking’ app

    January 27th, 2015

    Google asked to muzzle Waze ‘police-stalking’ app by Lisa Vaas.

    From the post:

    GPS trackers on vehicles; stingray devices to siphon mobile phone IDs and their owners’ locations; gunshot-detection sensors; license plate readers: these are just some of the types of surveillance technologies used by law enforcement, often without warrants.

    Now, US police are protesting the fact that citizens are using technology to track them, and they want Google to pull the plug on it.

    The technology being used to track police – regardless of whether they’re on their lunch break, assisting with a broken-down vehicle on the highway, or hiding in wait to nab speeders – is part of a popular mobile app, Waze, that Google picked up in 2013.

    Don’t you find it interesting that law enforcement has no apparent objection to mass surveillance and stalking of citizens but quickly rallies when effective crowd sourcing creates surveillance of the police?

    Lisa seizes on the highly unusual killing of two New York police officers in Brooklyn last December to make this a police safety issue. Random events are going to happen whether citizens report police locations or not.

    And if we are going to be “data-driven,” the numbers of line of duty deaths for police officers have been going downward for the past three years. 2011 – 171, 2012 – 122, 2013 – 102, so if Waze is having an impact on officer safety, it isn’t showing up in the data. Those numbers were collected by the National Law Enforcement Officers Memorial Fund.

    Shouldn’t policy decisions on surveillance/stalking be driven by data? If you oppose Waze, where is your data?

    Those are simple enough questions.

    Forward this post to your local newspaper and police department. If there is data to oppose Waze, let’s everyone see it.

    Yes?

    PS: You can find more information on Waze at: https://www.waze.com/. Not only do I hope Waze keeps posting the location of police officers but I hope they add politicians, bankers, CIA/NSA staff to their map. Effective crowd sourcing may be our only defense against government overreaching. Help keep everyone free with your location contributions.

    Chandra Celebrates the International Year of Light

    January 26th, 2015

    Chandra Celebrates the International Year of Light by Janet Anderson and Megan Watzke.

    From the webpage:

    The year of 2015 has been declared the International Year of Light (IYL) by the United Nations. Organizations, institutions, and individuals involved in the science and applications of light will be joining together for this yearlong celebration to help spread the word about the wonders of light.

    In many ways, astronomy uses the science of light. By building telescopes that can detect light in its many forms, from radio waves on one end of the “electromagnetic spectrum” to gamma rays on the other, scientists can get a better understanding of the processes at work in the Universe.

    NASA’s Chandra X-ray Observatory explores the Universe in X-rays, a high-energy form of light. By studying X-ray data and comparing them with observations in other types of light, scientists can develop a better understanding of objects likes stars and galaxies that generate temperatures of millions of degrees and produce X-rays.

    To recognize the start of IYL, the Chandra X-ray Center is releasing a set of images that combine data from telescopes tuned to different wavelengths of light. From a distant galaxy to the relatively nearby debris field of an exploded star, these images demonstrate the myriad ways that information about the Universe is communicated to us through light.

    Five objects at various distances that have been observed by Chandra

    SNR 0519-69.0: When a massive star exploded in the Large Magellanic Cloud, a satellite galaxy to the Milky Way, it left behind an expanding shell of debris called SNR 0519-69.0. Here, multimillion degree gas is seen in X-rays from Chandra (blue). The outer edge of the explosion (red) and stars in the field of view are seen in visible light from Hubble.

    Five objects at various distances that have been observed by Chandra

    Cygnus A: This galaxy, at a distance of some 700 million light years, contains a giant bubble filled with hot, X-ray emitting gas detected by Chandra (blue). Radio data from the NSF’s Very Large Array (red) reveal “hot spots” about 300,000 light years out from the center of the galaxy where powerful jets emanating from the galaxy’s supermassive black hole end. Visible light data (yellow) from both Hubble and the DSS complete this view.

    There are more images but one of the reasons I posted about Chandra is that the online news reports I have seen all omitted the most important information of all: Where to find more information!

    At the bottom of this excellent article on Chandra (which also doesn’t appear as a link in the news stories I have read), you will find:

    For more information on “Light: Beyond the Bulb,” visit the website at http://lightexhibit.org

    For more information on the International Year of Light, go to http://www.light2015.org/Home.html

    For more information and related materials, visit: http://chandra.si.edu

    For more Chandra images, multimedia and related materials, visit: http://www.nasa.gov/chandra

    Granted it took a moment or two to insert the hyperlinks but now any child or teacher or anyone else who wants more information can avoid the churn and chum of searching and go directly to the sources for more information.

    That doesn’t detract from my post. On the contrary, I hope that readers find that sort of direct linking to more resources helpful and a reason to return to my site.

    Granted I don’t have advertising and won’t so keeping people at my site is no financial advantage to me. But if I have to trap people into remaining at my site, it must not be a very interesting one. Yes?

    Why Internet Memory Is Important – Auschwitz

    January 26th, 2015

    After posting a note about Jill Lepore’s essay The Cobweb: Can the Internet be archived?, I found a great example of why memory and sources (like a footnote) are important.

    Today, 26 January 2015, is the 70th anniversary of the liberation of Auschwitz. The Telegraph gave this lead into its reprinting of the obituary of Rudolf Vrba:

    Rudolf Vrba escaped from Auschwitz in 1944 and was one of the first people to give first-hand evidence of the gas chambers, mass murder and plans to exterminate a million Jews. Nearly 70 years on from the liberation of the concentration camp, the Telegraph looks back on his legacy

    So horrific was the testimony from Rudolf Vrba, that the members of the Jewish Council in Hungary couldn’t quite believe what they were hearing.

    Vrba and Alfred Wetzler, who escaped with him in April 1944, drew up a detailed plan of Auschwtiz and its gas chambers, providing compelling evidence of what had previously been considered embellishment. It has since emerged that reports from inside Auschwitz, compiled by the Polish Underground State and the Polish Government in Exile and written by Jan Karski and Witold Pilecki among others, had in fact reached some Western allies before 1944, but action had not been taken.

    Vrba and Wetzler’s detailed, first-hand report about how Nazis were systematically killing Jews was compiled into the Wetzler-Vrba report and sent shockwaves around the world when it was circulated and picked up by international media in 1944.

    It still took some weeks before the report was accepted and credited after it was written – something that Vrba said had contributed to the deaths of an estimated 50,000 Hungarian Jews. Just weeks before their escape, German forces had invaded Hungary, and Jews there were already being shipped to Auschwitz. It wasn’t until the report made the headlines in international media that Hungary stopped the deportation in July of 1944.

    Ahead of the 70th anniversary of the liberation of Auschwitz on Monday 26th January, here is the Telegraph’s obituary of Vrba, who died in 2006, and is credited for opening the world’s eyes to the horrors of Auschwitz:

    The obituary is very moving but if you need to read The Auschwitz Protocol / The Vrba-Wetzler Report to get a true sense of the horror that was Auschwitz.

    The report is all the more chilling because of the lack of hype and matter of fact tone of the report. Quite different from the news we experience every day.

    Remembering an event such as Auschwitz is important, not to relive old wrongs but to attempt to avoid repeating those same wrongs again. Remembering Auschwitz did not prevent any of the bloodiness of the second half of the 20th century. Which if anything, exceeded the bloodiness of the first half, when famine, drought, disease and human neglect or malice are taken into account.

    But Auschwitz will live on in the memories of survivors and their children. Equally important, it will live on as a well documented event. Dislodging it from the historical record will take more than time.

    Can the same be said about many of the events and reports of events that now live only in digital media? We have done badly enough with revisionist history on actual events (see who defeated Germany). How much worse will we do when “history” can simply disappear? (As much already has from government archives no doubt.)

    Preserving discovery and analysis of the content of archives presumes there are archives to be mined for subjects and relationships between them. Talk to your local librarian about how to best support long term archiving in your organization, locality and national government. The history we loose could well be your own.

    I first saw the basis for this post in Vintage Infodesign [105].

    The Cobweb: Can the Internet be archived?

    January 26th, 2015

    The Cobweb: Can the Internet be archived? by Jill Lepore.

    From the post:

    Malaysia Airlines Flight 17 took off from Amsterdam at 10:31 A.M. G.M.T. on July 17, 2014, for a twelve-hour flight to Kuala Lumpur. Not much more than three hours later, the plane, a Boeing 777, crashed in a field outside Donetsk, Ukraine. All two hundred and ninety-eight people on board were killed. The plane’s last radio contact was at 1:20 P.M. G.M.T. At 2:50 P.M. G.M.T., Igor Girkin, a Ukrainian separatist leader also known as Strelkov, or someone acting on his behalf, posted a message on VKontakte, a Russian social-media site: “We just downed a plane, an AN-26.” (An Antonov 26 is a Soviet-built military cargo plane.) The post includes links to video of the wreckage of a plane; it appears to be a Boeing 777.

    Two weeks before the crash, Anatol Shmelev, the curator of the Russia and Eurasia collection at the Hoover Institution, at Stanford, had submitted to the Internet Archive, a nonprofit library in California, a list of Ukrainian and Russian Web sites and blogs that ought to be recorded as part of the archive’s Ukraine Conflict collection. Shmelev is one of about a thousand librarians and archivists around the world who identify possible acquisitions for the Internet Archive’s subject collections, which are stored in its Wayback Machine, in San Francisco. Strelkov’s VKontakte page was on Shmelev’s list. “Strelkov is the field commander in Slaviansk and one of the most important figures in the conflict,” Shmelev had written in an e-mail to the Internet Archive on July 1st, and his page “deserves to be recorded twice a day.”

    On July 17th, at 3:22 P.M. G.M.T., the Wayback Machine saved a screenshot of Strelkov’s VKontakte post about downing a plane. Two hours and twenty-two minutes later, Arthur Bright, the Europe editor of the Christian Science Monitor, tweeted a picture of the screenshot, along with the message “Grab of Donetsk militant Strelkov’s claim of downing what appears to have been MH17.” By then, Strelkov’s VKontakte page had already been edited: the claim about shooting down a plane was deleted. The only real evidence of the original claim lies in the Wayback Machine.

    If you aren’t a daily user of the the Internet Archive (home of the WayBack Machine) you are missing out on a very useful resource.

    Jill tells the story about the archive, its origins and challenges as well as I have heard it told. Very much worth your time to read.

    Hopefully after reading the story you will find ways to contribute/support the Internet Archive.

    Without the Internet Archive, the memory of the web would be distributed, isolated and in peril of erasure and neglect.

    I am sure many governments and corporations wish the memory of the web could be altered, let’s disappoint them!

    New Member of the Axis of Evil – Greece

    January 26th, 2015

    In case you haven’t heard, Greece has a new government, a leftist government. How convenient that CNN ran Add this to Greece’s list of problems: It’s an emerging hub for terrorists today.

    I won’t repeat the bogeyman rumors reported by CNN but suffice it to say that it is a first step towards establishing Greece can’t control its borders and so is a highway for terrorists.

    It doesn’t take a lot of imagination to realize who might want to “assist” Greece in controlling its borders. Assist as in “insist” in Greece controlling its borders. Should it fail to do so, well, there are always international coalitions willing to assist with such duties.

    The U.S. Dept. of Fear jumped on this yesterday. A great Twitter account to follow if you are interested in the smoke and mirrors that are the illusion of fighting terrorism.

    PS: Tell me, do you know if the Dulles brothers had any grandchildren? You may remember their efforts in Guatemala and Honduras on behalf of United Fruit Company. That included overthrowing governments, etc. Independence from the United States is possible, but ask Vietnam, at what cost?

    Cost To Be A Terrorist Hits Rock Bottom

    January 26th, 2015

    The cost of being a terrorist has dropped dramatically since 9/11 but it hit a new low when a free Twitter account was used to ground two planes with bomb threats. Both planes were “escorted” by F-16 fighters to safe landings. No bombs were found.

    The @KingZortic twitter account, as of last Saturday, is reported to have eleven (11) tweets, ten of which were threats.

    You can find more details at: F-16s Scrambled to Escort Jets After Twitter Bomb Threat. You can find the same account with varying verbage at any number of media outlets. I just happened upon that one first.

    Does ten tweets being “credible evidence” tell you something about the confidence of government officials in their airport security systems?

    Being a tweet literate terrorist allows you to avoid the unpleasantness of terrorism camps, being traced by to such camps, travel expenses, the camp fees and extra charges for ammunition, food, etc.

    No, under no circumstances should you become a tweeting terrorist, but on the other hand, you should not become a terrorist that uses cruise missiles to attack wedding parties either.

    How such activities will be treated depends on your national government and who you are terrorizing.

    Because of the very low bar to being a terrorist, Atlanta a free Twitter account, Paris easy to obtain automatic weapons, or being declared a terrorist, everyone should back away from terrorism as an instrument of state or near-state policy.

    That includes Western powers that are even now conducting terrorist campaigns in the Middle East. What else would you call it when a cruise missile or bomb kills? It is no less terrorizing than a car bomb or an AK-47. The trite line about trying to avoid civilian casualties is further evidence of Western moral arrogance, deciding who will live and who will die.

    Terrorism and the war on terrorism are equally wasteful of resources, lives and the economies of nations. What other priorities should replace terrorism and the war on terrorism I don’t know. What I do know is that the current efforts for and against terrorism are waste, waste pure and simple.

    Humpty-Dumpty on Being Data-Driven

    January 26th, 2015

    What’s Hampering Corporate Efforts to be Data-Driven? by Michael Essany.

    Michael summarizes a survey from Terradata that reports:

    • 47% of CEOs, or about half, believe that all employees have access to the data they need, while only 27% of other respondents agree.
    • 43% of CEOs think relevant data are captured and made available in real time, as opposed to 29% of other respondents.
    • CEOs are also more likely to think that employees extract relevant insights from data – 38% of them hold this belief, as compared to 24% among other the rest of respondents
    • 53% of CEOs think data utilization has made decision-making less hierarchical and further empowered employees, as compared to only 36% of the employees themselves.
    • 51% of CEOs believe data availability has improved employee engagement, satisfaction and retention, while only 35% of the rest agree.

    As marketing literature, Terradata’s survey is targeted at laying the failure to become “data-driven” at the door of CEOs.

    But Terradata didn’t ask or Michael did not report the answer to several other relevant questions:

    What are the characteristics of a business that can benefit from being “data-driven?” If you are going to promote being “data-driven,” shouldn’t there be data to establish being “data-driven” benefits a business? Real data, not the power point slide hand wavy stuff.

    Who signs the check for the enterprise is a more relevant question than the CEOs opinion about “data-driven,” IT in general or global warming.

    And as Humpty-Dumpty would say, in a completely different context: “The question is, which is to be master, that’s all!”

    I suppose as marketing glam it’s not bad but not all that impressive either. Data-driven marketing should be based on hard data and case studies with references. Upstairs/downstairs differences in perception hardly qualify as hard data.

    I first saw this in a tweet by Kirk Borne.

    Machine Learning Etudes in Astrophysics: Selection Functions for Mock Cluster Catalogs

    January 26th, 2015

    Machine Learning Etudes in Astrophysics: Selection Functions for Mock Cluster Catalogs by Amir Hajian, Marcelo Alvarez, J. Richard Bond.

    Abstract:

    Making mock simulated catalogs is an important component of astrophysical data analysis. Selection criteria for observed astronomical objects are often too complicated to be derived from first principles. However the existence of an observed group of objects is a well-suited problem for machine learning classification. In this paper we use one-class classifiers to learn the properties of an observed catalog of clusters of galaxies from ROSAT and to pick clusters from mock simulations that resemble the observed ROSAT catalog. We show how this method can be used to study the cross-correlations of thermal Sunya’ev-Zeldovich signals with number density maps of X-ray selected cluster catalogs. The method reduces the bias due to hand-tuning the selection function and is readily scalable to large catalogs with a high-dimensional space of astrophysical features.

    From the introduction:

    In many cases the number of unknown parameters is so large that explicit rules for deriving the selection function do not exist. A sample of the objects does exist (the very objects in the observed catalog) however, and the observed sample can be used to express the rules for the selection function. This “learning from examples” is the main idea behind classi cation algorithms in machine learning. The problem of selection functions can be re-stated in the statistical machine learning language as: given a set of samples, we would like to detect the soft boundary of that set so as to classify new points as belonging to that set or not. (emphasis added)

    Does the sentence:

    In many cases the number of unknown parameters is so large that explicit rules for deriving the selection function do not exist.

    sound like they could be describing people?

    I mention this as a reason why you should be read broadly in machine learning in particular and IR in general.

    What if all the known data about known terrorists, sans all the idle speculation by intelligence analysts, were gathered into a data set. Machine learning on that data set could then be tested against a simulation of potential terrorists, to help avoid the biases of intelligence analysts.

    Lest the undeserved fixation on Muslims blind security services to other potential threats, such as governments bent on devouring their own populations.

    I first saw this in a tweet by Stat.ML.

    Understanding Context

    January 25th, 2015

    Understanding Context by Andrew Hinton.

    From the post:

    Technology is destabilizing the way we understand our surroundings. From social identity to ubiquitous mobility, digital information keeps changing what here means, how to get there, and even who we are. Why does software so easily confound our perception and scramble meaning? And how can we make all this complexity still make sense to our users?

    Understanding Context — written by Andrew Hinton of The Understanding Group — offers a powerful toolset for grasping and solving the challenges of contextual ambiguity. By starting with the foundation of how people perceive the world around them, it shows how users touch, navigate, and comprehend environments made of language and pixels, and how we can make those places better.

    Understanding Context is ideal for information architects, user experience professionals, and designers of digital products and services of any scope. If what you create connects one context to another, you need this book.

    Final_Final_forWeb_250_withMorville

    Amazon summarizes in part:

    You’ll discover not only how to design for a given context, but also how design participates in making context.

    • Learn how people perceive context when touching and navigating digital environments
    • See how labels, relationships, and rules work as building blocks for context
    • Find out how to make better sense of cross-channel, multi-device products or services
    • Discover how language creates infrastructure in organizations, software, and the Internet of Things
    • Learn models for figuring out the contextual angles of any user experience

    This book is definitely going on my birthday wish list at Amazon. (There done!)

    Looking forward to a slow read and in the meantime, will start looking for items from the bibliography.

    My question, of course, is that after expending all the effort to discover and/or design a context, how do I pass that context onto another?

    To someone coming from a slightly different context? (Assuming always that the designer is “in” a context.)

    From a topic map perspective, what subjects do I need to represent to capture a visual context? Even more difficult, what properties of those subjects do I need to capture to enable their discovery by others? Or to facilitate mapping those subjects to another context/domain?

    Definitely a volume I would assign as reading for a course on topic maps.

    I first saw this in a tweet by subjectcentric.

    Introducing Espresso – LinkedIn’s hot new distributed document store

    January 25th, 2015

    Introducing Espresso – LinkedIn’s hot new distributed document store by Aditya Auradkar.

    From the post:

    Espresso is LinkedIn’s online, distributed, fault-tolerant NoSQL database that currently powers approximately 30 LinkedIn applications including Member Profile, InMail (LinkedIn’s member-to-member messaging system), portions of the Homepage and mobile applications, etc. Espresso has a large production footprint at LinkedIn with over a dozen clusters in use. It hosts some of the most heavily accessed and valuable datasets at LinkedIn serving millions of records per second at peak. It is the source of truth for hundreds of terabytes (not counting replicas) of data.

    Motivation

    To meet the needs of online applications, LinkedIn traditionally used Relational Database Management Systems (RDBMSs) such as Oracle and key-value stores such as Voldemort – both serving different use cases. Much of LinkedIn requires a primary, strongly consistent, read/write data store that generates a timeline-consistent change capture stream to fulfill nearline and offline processing requirements. It has become apparent that many, if not most, of the primary data requirements of LinkedIn do not require the full functionality of monolithic RDBMSs, nor can they justify the associated costs.

    Expresso-Features

    A must read if you are concerned with BigData and/or distributed systems.

    A refreshing focus on requirements, as opposed to engineering by slogan, “all the world’s a graph.”

    Looking forward to more details on Expresso as they emerge.

    I first saw this in a tweet by Martin Kleppmann.

    A practical introduction to functional programming

    January 25th, 2015

    A practical introduction to functional programming by Mary Rose Cook.

    From the post:

    Many functional programming articles teach abstract functional techniques. That is, composition, pipelining, higher order functions. This one is different. It shows examples of imperative, unfunctional code that people write every day and translates these examples to a functional style.

    The first section of the article takes short, data transforming loops and translates them into functional maps and reduces. The second section takes longer loops, breaks them up into units and makes each unit functional. The third section takes a loop that is a long series of successive data transformations and decomposes it into a functional pipeline.

    The examples are in Python, because many people find Python easy to read. A number of the examples eschew pythonicity in order to demonstrate functional techniques common to many languages: map, reduce, pipeline.

    After spending most of the day with poor documentation, this sort of post is a real delight. It took more effort than the stuff I was reading today but it saves every reader time, rather than making them lose time.

    Perhaps I should create an icon to mark documentation that will cost you more time than searching a discussion list for the answer.

    Yes?

    I first saw this in a tweet by Gianluca Fiore.

    Comparative Oriental Manuscript Studies: An Introduction

    January 25th, 2015

    Comparative Oriental Manuscript Studies: An Introduction edited by: Alessandro Bausi (General editor), et al.

    The “homepage” of this work enables you to download the entire volume or individual chapters, depending upon your interests. It provides a lengthy introduction to codicology, palaeography, textual criticism and text editing, and of special interest to library students, cataloguing as well as conservation and preservation.

    Alessandro Bausi writes in the preface:

    Thinking more broadly, our project was also a serious attempt to defend and preserve the COMSt-related fields within the academic world. We know that disciplines and fields are often determined and justified by the mere existence of an easily accessible handbook or, in the better cases, sets of handbooks, textbooks, series and journals. The lack of comprehensive introductory works which are reliable, up-to-date, of broad interest and accessible to a wide audience and might be used in teaching, has a direct impact on the survival of the ‘small subjects’ most of the COMSt-related disciplines pertain to. The decision to make the COMSt handbook freely accessible online and printable on demand in a paper version at an affordable price was strategic in this respect, and not just meant to meet the prescriptions of the European Science Foundation. We deliberately declined to produce an extremely expensive work that might be bought only by a few libraries and research institutions; on the other hand, a plain electronic edition only to be accessed and downloaded as a PDF file was not regarded as a desirable solution either. Dealing with two millennia of manuscripts and codices, we did not want to dismiss the possibility of circulating a real book in our turn.

    It remains, hopefully, only to say,

    Lector intende: laetaberis

    John Svarlien says: A rough translation is: “Reader, pay attention. You will be happy you did.”

    We are all people of books. It isn’t possible to separate present day culture and what came before it from books. Even people who shun reading of books, are shaped by forces that can be traced back to books.

    But books did not suddenly appear as mass-printed paperbacks in airport lobbies and checkout lines in grocery stores. There is a long history of books prior to printing to the edges of the formation of codices.

    This work is an introduction to the fascinating world of studying manuscripts and codices prior to the invention of printing. When nearly every copy of a work is different from every other copy, you can imagine the debates over which copy is the “best” copy.

    Imagine some versions of “Gone with the Wind” ending with:

    • Frankly, my dear, I don’t give a damn. (traditional)
    • Ashley and I don’t give a damn. (variant)
    • Cheat Ashley out of his business I suppose. (variant)
    • (Lacks a last line due to mss. damage.) (variant)

    The “text” of yesteryear lacked the uniform sameness of the printed “text” of today.

    When you think about your “favorite” version in the Bible, it is likely a “majority” reading but hardly the only one.

    With the advent of the printing press, texts took on the opportunity to be uniformly produced in mass quantities.

    With the advent of electronic texts, either due to editing or digital corruption, we are moving back towards non-uniform texts.

    Will we see the birth of digital codicology and its allied fields for digital texts?

    PS: Please forward the notice of this book to your local librarian.

    I first saw this in a tweet by Kirk Lowery.

    Crawling the WWW – A $64 Question

    January 24th, 2015

    Have you ever wanted to crawl the WWW? To make a really comprehensive search? Waiting for a private power facility and server farm? You need wait no longer!

    Ross Fairbanks details in WikiReverse data pipeline details the creation of Wikireverse:

    WikiReverse is a reverse web-link graph for Wikipedia articles. It consists of approximately 36 million links to 4 million Wikipedia articles from 900,000 websites.

    You can browse the data at WikiReverse or downloaded from S3 as a torrent.

    The first thought that struck me was the data set would be useful for deciding which Wikipedia links are the default subject identifiers for particular subjects.

    My second thought was what a wonderful starting place to find links with similar content strings, for the creation of topics with multiple subject identifiers.

    My third thought was, $64 to search a CommonCrawl data set!

    You can do a lot of searches at $64 per before you get to the cost of a server farm, much less a server farm plus a private power facility.

    True, it won’t be interactive but then few searches at the NSA are probably interactive. ;-)

    The true upside being you are freed from the tyranny of page-rank and hidden algorithms by which vendors attempt to guess what is best for them and secondarily, what is best for you.

    Take the time to work through Ross’ post and develop your skills with the CommonCrawl data.

    Tooling Up For JSON

    January 24th, 2015

    I needed to explore a large (5.7MB) JSON file and my usual command line tools weren’t a good fit.

    Casting about I discovered Jshon: Twice as fast, 1/6th the memory. From the home page for Jshon:

    Jshon parses, reads and creates JSON. It is designed to be as usable as possible from within the shell and replaces fragile adhoc parsers made from grep/sed/awk as well as heavyweight one-line parsers made from perl/python. Requires Jansson

    Jshon loads json text from stdin, performs actions, then displays the last action on stdout. Some of the options output json, others output plain text meta information. Because Bash has very poor nested datastructures, Jshon does not try to return a native bash datastructure as a tpical library would. Instead, Jshon provides a history stack containing all the manipulations.

    The big change in the latest release is switching the everything from pass-by-value to pass-by-reference. In a typical use case (processing AUR search results for ‘python’) by-ref is twice as fast and uses one sixth the memory. If you are editing json, by-ref also makes your life a lot easier as modifications do not need to be manually inserted through the entire stack.

    Jansson is described as: “…a C library for encoding, decoding and manipulating JSON data.” Usual ./configure, make, make install. Jshon has no configure or install script so just make and toss it somewhere that is in your path.

    Under Bugs you will read: “Documentation is brief.”

    That’s for sure!

    Still, it has enough examples that with some practice you will find this a handy way to explore JSON files.

    Enjoy!

    History Depends On Who You Ask, And When

    January 24th, 2015

    You have probably seen the following graphic but it bears repeating:

    sondage-nation-contribue-defaite-nazis

    The image is from: Who contributed most to the defeat of Nazi Germany in 1945?

    From the post:

    A survey conducted in May 1945 on the whole French territory now released (confirming a survey in September 1944 with Parisians) showed that interviewees appear well aware of the power relations and the role of allies in the war, despite the censorship and the difficulty to access reliable information under enemy’s occupation.

    A clear majority (57%) believed that the USSR is the nation that has contributed most to the defeat of Germany while the United States and England will gather respectively 20% and 12%.

    But what is truly astonishing is that this vision of public opinion was reversed very dramatically with time, as shown by two surveys conducted in 1994 and 2004. In 2004, 58% of the population were convinced that USA played the biggest role in the Second World War and only 20% were aware of the leading role of USSR in defeating the Nazi.

    This is a very clear example of how the propaganda adjusted the whole nation’s perception of history, the evaluation of the fundamental contribution to the allied victory in the World War II.

    Whether this change in attitude was the result of “propaganda” or some less directed social process I cannot say.

    What I do find instructive is that over sixty (60) years, less than one lifetime, public perception of the “truth” can change that much.

    How much greater the odds that the “truth” of events one hundred years ago are different from the ones we hold now.

    To say nothing of the “truth” of events several thousand years ago, which we have reported only a handful of times, reports that have been edited to suite particular agendas.

    Or we have some physical relics that occur at one location, sans any contemporaneous documentation, which we would not understand in its ancient context but in ours.

    That should not dissuade us from writing histories, but it should make us cautious about taking action based on historical “truths.”

    I most recently saw this in a tweet by Anna Pawlicka.

    A first look at Spark

    January 24th, 2015

    A first look at Spark by Joseph Rickert.

    From the post:

    Apache Spark, the open-source, cluster computing framework originally developed in the AMPLab at UC Berkeley and now championed by Databricks is rapidly moving from the bleeding edge of data science to the mainstream. Interest in Spark, demand for training and overall hype is on a trajectory to match the frenzy surrounding Hadoop in recent years. Next month's Strata + Hadoop World conference, for example, will offer three serious Spark training sessions: Apache Spark Advanced Training, SparkCamp and Spark developer certification with additional spark related talks on the schedule. It is only a matter of time before Spark becomes a big deal in the R world as well.

    If you don't know much about Spark but want to learn more, a good place to start is the video of Reza Zadeh's keynote talk at the ACM Data Science Camp held last October at eBay in San Jose that has been recently posted.

    After reviewing the high points of Reza Zadeh's presentation, Joseph points out another 4 hours+ of videos on using Spark and R together.

    A nice collection for getting started with Spark and seeing how to use a standard tool (R) with an emerging one (Spark).

    I first saw this in a tweet by Christophe Lalanne.