Data Preservation « Another Word For It

Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

May 21, 2018

Contrived Russian Facebook Ad Data

Filed under: Data Preservation,Data Quality,Data Science,Facebook,Politics — Patrick Durusau @ 2:16 pm

When I first read about: Facebook Ads: Exposing Russia’s Effort to Sow Discord Online: The Internet Research Agency and Advertisements, a release of alleged Facebook ads, by Democrats of the House Permanent Select Committee on Intelligence, I should have just ignored it.

But any number of people whose opinions I respect, seem deadly certain that Facebook ads, purchased by Russians, had a tipping impact on the 2016 presidential election. At least I should look at the purported evidence offered by House Democrats. The reporting I have seen on the release indicates at best skimming of the data, if it is read at all.

It wasn’t until I started noticing oddities in a sample of the data that I cleaned that the full import of:

Redactions Completed at the Direction of Ranking Member of the US House Permanent Select Committee on Intelligence

That statement appears in every PDF file. Moreover, if you check the properties of any of the PDF files, you will find a creation date in May of 2018.

I had been wondering why Facebook would deliver ad data to Congress as PDF files. Just seemed odd, something nagging in the back of my mind. Terribly inefficient way to deliver ad data.

The “redaction” notice and creation dates make it clear that the so-called Facebook ad PDFs, are wholly creations of the House Permanent Select Committee on Intelligence, and not Facebook.

I bring that break in the data chain because without knowing the content of the original data from Facebook, there is no basis for evaluating the accuracy of the data being delivered by Congressional Democrats. It may or may not bear any resemblance to the data from Facebook.

Rather than a blow against whoever the Democrats think is responsible, this is a teaching moment about the provenance of data. If there is a gap, such as the one here, the only criteria for judging the data is do you like the results? If so, it’s good data, if not, then it’s bad data.

Why so-called media watch-dogs on “fake news” and mis-information missed such an elementary point isn’t clear. Perhaps you should ask them.

While cleaning the data for October of 2016, my suspicions were re-enforced by the following:

Doesn’t it strike you as odd that both the exclusion targets and ad targets are the same? Granting it’s only seven instances in this one data sample of 135 ads, but that’s enough for me to worry about the process of producing the files in question.

If you decide to invest any time in this artifice of congressional Democrats, study the distribution of the so-called ads. I find it less than credible that August of 2017 had one ad placed by (drum roll), the Russians! FYI, July 2017 had only seven.

Being convinced the Facebook ad files from Congress are contrived representations with some unknown relationship to Facebook data, I abandoned the idea of producing a clean data set.

Resources:

PDFs produced by Congress, relationship to Facebook data unknown.

Cleaned July, 2015 data set by Patrick Durusau.

Text of all the Facebook ads (uncleaned), September 2015 – August 2017 (missing June – 2017) by Patrick Durusau. (1.2 MB vs. their 8 GB.)

Seriously pursuit of any theory of ads influencing the 2016 presidential election, has the following minimal data requirements:

All the Facebook content posted for the relevant time period.
Identification of paid ads and by what group, organization, government they were placed.

Assuming that data is available, similarity measures of paid versus user content and measures of exposure should be undertaken.

Notice that none of the foregoing “prove” influence on an election. Those are all preparatory steps towards testing theories of influence and on who, to what extent?

Comments Off

August 17, 2017

If You See Something, Save Something (Poke A Censor In The Eye)

Filed under: Archives,Data Preservation,Data Replication,Preservation,Web History — Patrick Durusau @ 8:48 pm

If You See Something, Save Something – 6 Ways to Save Pages In the Wayback Machine by Alexis Rossi.

From the post:

In recent days many people have shown interest in making sure the Wayback Machine has copies of the web pages they care about most. These saved pages can be cited, shared, linked to – and they will continue to exist even after the original page changes or is removed from the web.

There are several ways to save pages and whole sites so that they appear in the Wayback Machine. Here are 6 of them.
…

In the comments, Ellen Spertus mentions a 7th way: Donate to the Internet Archive!

It’s the age of censorship, by governments, DMCA, the EU (right to be forgotten), Facebook, Google, Twitter and others.

Poke a censor in the eye, see something, save something to the Wayback Machine.

The Wayback Machine can’t stop all censorship, so save local and remote copies as well.

Keep poking until all censors go blind.

Comments Off

February 10, 2015

Data as “First Class Citizens”

Filed under: Annotation,Data,Data Preservation,Data Repositories,Documentation — Patrick Durusau @ 7:34 pm

Data as “First Class Citizens” by Łukasz Bolikowski, Nikos Houssos, Paolo Manghi, Jochen Schirrwagen.

The guest editorial to D-Lib Magazine, January/February 2015, Volume 21, Number 1/2, introduces a collection of articles focusing on data as “first class citizens,” saying:

Data are an essential element of the research process. Therefore, for the sake of transparency, verifiability and reproducibility of research, data need to become “first-class citizens” in scholarly communication. Researchers have to be able to publish, share, index, find, cite, and reuse research data sets.

The 2nd International Workshop on Linking and Contextualizing Publications and Datasets (LCPD 2014), held in conjunction with the Digital Libraries 2014 conference (DL 2014), took place in London on September 12th, 2014 and gathered a group of stakeholders interested in growing a global data publishing culture. The workshop started with invited talks from Prof. Andreas Rauber (Linking to and Citing Data in Non-Trivial Settings), Stefan Kramer (Linking research data and publications: a survey of the landscape in the social sciences), and Dr. Iain Hrynaszkiewicz (Data papers and their applications: Data Descriptors in Scientific Data). The discussion was then organized into four full-paper sessions, exploring orthogonal but still interwoven facets of current and future challenges for data publishing: “contextualizing and linking” (Semantic Enrichment and Search: A Case Study on Environmental Science Literature and A-posteriori Provenance-enabled Linking of Publications and Datasets via Crowdsourcing), “new forms of publishing” (A Framework Supporting the Shift From Traditional Digital Publications to Enhanced Publications and Science 2.0 Repositories: Time for a Change in Scholarly Communication), “dataset citation” (Data Citation Practices in the CRAWDAD Wireless Network Data Archive, A Methodology for Citing Linked Open Data Subsets, and Challenges in Matching Dataset Citation Strings to Datasets in Social Science) and “dataset peer-review” (Enabling Living Systematic Reviews and Clinical Guidelines through Semantic Technologies and Data without Peer: Examples of Data Peer Review in the Earth Sciences).

We believe these investigations provide a rich overview of current issues in the field, by proposing open problems, solutions, and future challenges. In short they confirm the urgent and fascinating demands of research data publishing.

The only stumbling point in this collection of essays is the notion of data as “First Class Citizens.” Not that I object to a catchy title but not all data is going to be equal when it comes to first class citizenship.

Take Semantic Enrichment and Search: A Case Study on Environmental Science Literature, for example. Great essay on using multiple sources to annotate entities once disambiguation had occurred. But some entities are going to have more annotations than others and some entities may not be recognized at all. Not to mention it is rather doubtful that the markup containing those entities was annotated at all?

Are we sure we want to exclude from data the formats that contain the data? Isn’t a format a form of data? As well as the instructions for processing that data? Perhaps not in every case but shouldn’t data and the formats that hold date be equally treated as first class citizens? I am mindful that hundreds of thousands of people saw the pyramids being built but we have not one accurate report on the process.

Will the lack of that one accurate report deny us access to data quite skillfully preserved in a format that is no longer accessible to us?

While I support the cry for all data to be “first class citizens,” I also support a very broad notion of data to avoid overlooking data that may be critical in the future.

Comments Off

November 7, 2013

Hot Topics: The DuraSpace Community Webinar Series

Filed under: Archives,Data Preservation,DSpace,Preservation — Patrick Durusau @ 7:14 pm

Hot Topics: The DuraSpace Community Webinar Series

From the DuraSpace about page:

DuraSpace supported open technology projects provide long-term, durable access to and discovery of digital assets. We put together global, strategic collaborations to sustain DSpace and Fedora, two of the most widely-used repository solutions in the world. More than fifteen hundred institutions use and help develop these open source software repository platforms. DSpace and Fedora are directly supported with in-kind contributions of development resources and financial donations through the DuraSpace community sponsorship program.

Like most of you, I’m familiar with DSpace andFedora but I wasn’t familiar with the “Hot Topics” webinar series. I was following a link from Recommended! “Metadata and Repository Services for Research Data Curation” Webinar by Imma Subirats, when I encountered the “Hot Topics” page.

Series Six: Research Data in Repositories
Series Five: VIVO–Research Discovery and Networking
Series Four: Research Data Management Support
Series Three: Get a Head on Your Repository with Hydra End-to-End Solutions
Series Two: Managing and Preserving Audio and Video in your Digital Repository
Series One: Knowledge Futures: Digital Preservation Planning

Each series consists of three (3) webinars, all with recordings, most with slides as well.

Warning: Data curation doesn’t focus on the latest and coolest GPU processing techniques.

But, in ten to fifteen years when GPU techniques are like COBOL is now, good data curation will enable future students to access those techniques.

I think that is worthwhile.

You?

Comments Off

March 22, 2013

Making It Happen:…

Filed under: Data,Data Preservation,Preservation — Patrick Durusau @ 9:45 am

Making It Happen: Sustainable Data Preservation and Use by Anita de Waard.

Great set of overview slides on why research data should be preserved.

Not to mention making the case that semantic diversity, in systems for capturing research data, between researchers, etc., needs to be addressed by any proffered solution.

If you don’t know Anita de Waard’s work, search for “Anita de Waard” on Slideshare.

As of today, I am getting one hundred and forty (140) presentations.

All of which you will find useful on a variety of data related topics.

Comments Off

March 19, 2013

… Preservation and Stewardship of Scholarly Works, 2012 Supplement

Filed under: Archives,Bibliography,Curation,Data Preservation,Digital Library,Librarian/Expert Searchers — Patrick Durusau @ 12:46 pm

Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement by Charles W. Bailey, Jr.

From the webpage:

In a rapidly changing technological environment, the difficult task of ensuring long-term access to digital information is increasingly important. The Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, 2012 Supplement presents over 130 English-language articles, books, and technical reports published in 2012 that are useful in understanding digital curation and preservation. This selective bibliography covers digital curation and preservation copyright issues, digital formats (e.g., media, e-journals, research data), metadata, models and policies, national and international efforts, projects and institutional implementations, research studies, services, strategies, and digital repository concerns.

It is a supplement to the Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works, which covers over 650 works published from 2000 through 2011. All included works are in English. The bibliography does not cover conference papers, digital media works (such as MP3 files), editorials, e-mail messages, letters to the editor, presentation slides or transcripts, unpublished e-prints, or weblog postings.

The bibliography includes links to freely available versions of included works. If such versions are unavailable, italicized links to the publishers' descriptions are provided.

Links, even to publisher versions and versions in disciplinary archives and institutional repositories, are subject to change. URLs may alter without warning (or automatic forwarding) or they may disappear altogether. Inclusion of links to works on authors' personal websites is highly selective. Note that e-prints and published articles may not be identical.

The bibliography is available under a Creative Commons Attribution-NonCommercial 3.0 Unported License.

Supplement to “the” starting point for research on digital curation.

Comments Off

February 8, 2013

History SPOT

Filed under: Data Preservation,Database,History,Natural Language Processing — Patrick Durusau @ 5:15 pm

History SPOT

I discovered this site via a post entitled: Text Mining for Historians: Natural Language Processing.

From the webpage:

Welcome to History SPOT. This is a subsite of the IHR [Institute of Historical Research] website dedicated to our online research training provision. On this page you will find the latest updates regarding our seminar podcasts, online training courses and History SPOT blog posts.

Currently offered online training courses (free registration required):

Designing Databases for Historians
Podcasting for Historians
Sources for British History on the Internet
Data Preservation
Digital Tools
InScribe Palaeography

Not to mention over 300 pod casts!

Two thoughts:

First, a good way to learn about the tools and expectations that historians have of their digital tools. That should help you prepare an answer to: “What do topic maps have to offer over X technology?”

Second, I rather like the site and its module orientation. A possible template for topic map training online?

Comments Off

January 16, 2013

Research Data Curation Bibliography

Filed under: Archives,Curation,Data Preservation,Librarian/Expert Searchers,Library — Patrick Durusau @ 7:56 pm

Research Data Curation Bibliography (version 2) by Charles W. Bailey.

From the introduction:

The Research Data Curation Bibliography includes selected English-language articles, books, and technical reports that are useful in understanding the curation of digital research data in academic and other research institutions. For broader coverage of the digital curation literature, see the author's Digital Curation Bibliography: Preservation and Stewardship of Scholarly Works,which presents over 650 English-language articles, books, and technical reports.

The "digital curation" concept is still evolving. In "Digital Curation and Trusted Repositories: Steps toward Success," Christopher A. Lee and Helen R. Tibbo define digital curation as follows:

Digital curation involves selection and appraisal by creators and archivists; evolving provision of intellectual access; redundant storage; data transformations; and, for some materials, a commitment to long-term preservation. Digital curation is stewardship that provides for the reproducibility and re-use of authentic digital data and other digital assets. Development of trustworthy and durable digital repositories; principles of sound metadata creation and capture; use of open standards for file formats and data encoding; and the promotion of information management literacy are all essential to the longevity of digital resources and the success of curation efforts.

This bibliography does not cover conference papers, digital media works (such as MP3 files), editorials, e-mail messages, interviews, letters to the editor, presentation slides or transcripts, unpublished e-prints, or weblog postings. Coverage of technical reports is very selective.

Most sources have been published from 2000 through 2012; however, a limited number of earlier key sources are also included. The bibliography includes links to freely available versions of included works. If such versions are unavailable, italicized links to the publishers' descriptions are provided.

Such links, even to publisher versions and versions in disciplinary archives and institutional repositories, are subject to change. URLs may alter without warning (or automatic forwarding) or they may disappear altogether. Inclusion of links to works on authors' personal websites is highly selective. Note that e prints and published articles may not be identical.

An archive of prior versions of the bibliography is available.

If you are a beginning library student, take the time to know the work of Charles Bailey. He has consistently made a positive contribution for researchers from very early in the so-called digital revolution.

To the extent that you want to design topic maps for data curation, long or short term, the 200+ items in this bibliography will introduce you to some of the issues you will be facing.

Comments Off

December 17, 2012

Data and Software Preservation for Open Science (DASPOS)

Filed under: BigData,Data Preservation,HEP - High Energy Physics,Software Preservation — Patrick Durusau @ 11:36 am

I first read in: Preserving Science Data and Software for Open Science:

One of the emerging, and soon to be defining, characteristics of science research is the collection, usage and storage of immense amounts of data. In fields as diverse as medicine, astronomy and economics, large data sets are becoming the foundation for new scientific advances. A new project led by University of Notre Dame researchers will explore solutions to the problems of preserving data, analysis software and computational work flows, and how these relate to results obtained from the analysis of large data sets.

Titled “Data and Software Preservation for Open Science (DASPOS),” the National Science Foundation-funded $1.8 million program is focused on high energy physics data from the Large Hadron Collider (LHC) and the Fermilab Tevatron.

The research group, which is led by Mike Hildreth, a professor of physics; Jarek Nabrzyski, director of the Center for Research Computing with a concurrent appointment as associate professor of computer science and engineering; and Douglas Thain, associate professor of computer science and engineering, also will survey and incorporate the preservation needs of other research communities, such as astrophysics and bioinformatics, where large data sets and the derived results are becoming the core of emerging science in these disciplines.

Preservation of data and software semantics. Sounds like topic maps!

Materials you may find useful:

Status Report of the DPHEP Study Group: Towards a Global Effort for Sustainable Data Preservation in High Energy Physics (May 2012, Omitted the last 40 authors so I am omitting the first 50 authors. See the paper for the complete list.)

Data Preservation in High Energy Physics (December 2009, forerunner to the 2012 report)

DASPOS: Common Formats? by Mike Hildreth (slides, 19 November 2012)

DASPOS Overview by Mike Hildreth (slides, 20 November 2012)

Perhaps the most important statement from the 20 November slides:

A “scouting party”: push forward in what looks like a good direction without worrying about full world-wide consensus

I have participated in, seen, read about, any number of projects and well, this is quite refreshing.

Starting a project with or prematurely developing final answers is a guarantee of poor results.

Both science and the humanities explore to find answers. Why should developing standards be any different?

A great deal to be learned here, even if you are just listening in on the conversations.

Comments (1)