Another Word For It Patrick Durusau on Topic Maps and Semantic Diversity

January 25, 2014

Exporting GraphML from Neo4j

Filed under: GraphML,Graphs,Neo4j — Patrick Durusau @ 3:09 pm

I created a graph database in Neo4j 2.0 directly from a Twitter stream. To get better display capabilities, I wanted to export the database for loading into Gephi using neo4j-shell-tools.

Well, the export did create an XML file. Unfortunately, not a “well-formed” one. 🙁

The first error was that the “&” character was not written with an entity. The “&” characters were in the Twitter text stream but should have been replaced upon export as XML. Michael Hunger responded quite quickly with a revision to neo4j-shell-tools to get me past that issue. (The new version also replaces < and > in the text flow. Be careful if you have markup inside processing instructions stored in a Neo4j database. Admittedly an edge case.)

A problem that remains unresolved is that the Graphml export file has a UTF-8 declaration but in fact contains high ASCII characters.

Here are four examples that are part of what I posted to the Neo4j mailing list. Each example is preceded by an XML comment about the improper character at that node.

<code><!– Node n16, see “non SGML character number 128_” immediately following “BBSeedfund”
<node id=”n16″ labels=”User” > @SBSSeedfund • Looking into …</data></node>
<!– Node n26 – “ÜT” non SGML character number 156 – special ASCII character –>
<node id=”n26″ labels=”User” ><data key=”labels”>…<data key=”location”>ÜT: 51.450038,6.802151</data>…</node>
<!– Node n35 – ≠ non SGML character number 137 –>
<node id=”n35″ labels=”User” >… RT ≠ endorsement</data>…</node>
<!– Node n58 – ™ non SGML character number 132 –>
<node id=”n58″ labels=”User” >CONFERENCE™ is the …</data></node>
</code>

One solution is to parse the file in an XML editor and with save/replace to eliminate the offending characters.

A better solution is to grab a copy of HTML Tidy for HTML5 (experimental) and use it to eliminate the high ASCII characters.

HTML Tidy converts high ASCII into entities so you will have some odd looking display text.

I used a config.txt file with the following settings:

input-encoding: ascii
output-xml: yes
input-xml: yes
show-warnings: yes
numeric-entities: yes

I set input-encoding: ascii because the UTF-8 encoding declaration from Neo4j isn’t correct. And with that setting, HTML Tidy automatically replaces high ASCII with entities.

Made the file acceptable to Gephi.

While I understand Neo4j being liberal in terms of what it accepts for input, it needs to work on exporting well-formed XML.

No Comments

No comments yet.

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress