This is the first in an ongoing series of technical blogs from the DocMaps Project, a foundational internet infrastructure effort aimed to improve trust and velocity in distributed research ecosystems.
Preprint articles are rapidly becoming the first line of impact for cutting-edge science. The opportunity for researchers to move more quickly, by building on extremely recent discoveries that have not gone through journal processes, is debased by an equally fast rise in untrustworthy science generated by disreputable journals and artificial intelligence. DocMaps is a community-endorsed framework for capturing valuable context about the processes used to create documents in a machine-readable way. With DocMaps, we hope to improve the systemic clarity around preprint science and enable rapid, trustworthy discovery in reporting about these documents.
Docmaps are gaining traction among preprint servers and aggregators dedicated to enabling open science, including endorsements and integrations from platforms including eLife/Sciety, bioRxiv, EMBO, and CSHL. As large preprint repositories automate their creation of Docmaps, every other participant in open science stands to benefit more and more from adopting the protocol too. Some of these organizations, including Knowledge Futures Inc who are custodians of the Project, are able to dedicate ongoing development effort to keep shared tooling up-to-date with the needs of the community.
DocMaps is open-source, and built on modern web technologies — and that goes as well for all the software needed to create, consume, and distribute DocMaps. Although journals are likely to benefit from adopting DocMaps, they are not in an exclusive position to control the next generation’s vehicle for trust in science.
DocMaps is a lightweight protocol. Its core is defined as a standard vocabulary of terms and reference structures for representing complexity. Its native data format is JSON-LD, which can interoperate with an underlying graph-like datastore, or with a JSON-based tabular data store, and heterogenous collaborators can communicate with the same Docmaps naturally.
Why build this new open system? Historically, the peer review system has appeared sufficiently trustworthy, reliable, scalable, and pollution-resistant. If this systemic reliability was ever as strong as one might hope, it is fast eroding.
The first tide to rise here has been the emergence in the last decade or so of predatory journals. These journals exploit the complicated alimentary incentive structures inherent in professionalized research where there is competition. The challenges of detecting predatory publication behaviors have been the subject of study.
Secondly, there is growing evidence of partisan interference in science that is funded by political or private parties to produce the appearance of evidence supporting their worldview or otherwise influence people. Incomplete processes may be promoted as final results; phony publications can be debunked or disavowed in a public forum without a discoverable and durable record of that dissidence.
Third, it is projected that increasingly sophisticated Large Language Models will be able to cheaply generate scientific publications that “look good” and can be rubber-stamped by inattentive or corrupt peer reviewers/journals.
In all these cases, peer review and scientific consensus can be thought of as a classical Byzantine adversarial problem. A thorough theoretical threat analysis from this perspective can be a topic for future writing if this audience would find it entertaining or instructive.
In a final case, there are recently emerging domains where the slow iterations that journal publications typically undergo are simply too slow. Researchers who rely on cutting-edge research are already going directly to preprints for their field’s latest results. Mechanisms for preprint peer review are largely bespoke and do not interoperate (unless they are built with docmaps). These researchers can move faster with confidence if efforts to reproduce results or analyze publications can be shared in an interoperable way. What is needed to filter the chaos surrounding the recent superconductor preprint and establish an easily-consumed public record of the reasons for or against crediting the results?
The basic goal of a docmap is to make a record of the chain of custody of a scientific document. It can record all the steps that were taken throughout its process, including very granular (every draft, every comment) or simply the milestones “Preprint Posted” and “Journal Article Published.”
As part of reading a Step in a docmap, you can also learn that the step caused a change in the document’s status. For example, one may claim that due to the addition of reviews, the document’s status is now Peer Reviewed.
And more: you can pass a docmap through an open-source visualizer tool and see a concise timeline of your document with relevant steps and statuses acquired. You can reformat the relevant metadata for your own purposes, such as presentation or creating search indexes on the corresponding document.
Most importantly, because a docmap is machine-readable, it is a vehicle for more complex computation that may be of interest about processes that created documents.
Imagine a scenario where we have:
• a preprint aggregator and search engine,
• two preprint servers, each of which has a process for some quality assurance,
• an editorial journal that provides peer review of preprints and republishes them as journal articles.
In this situation, any of the parties may offer DocMaps data about preprints, to each other or to an outside consumer. As a researcher, you might be searching the aggregator for preprints that have been peer reviewed. Not only can the aggregator use the docmaps to build efficient search algorithms that can sort by the quality metrics given by the servers, and filter by peer-reviewed status, but they can even serve the docmap to you directly as a form of “bill of custody”, perhaps simplified as a timeline showing you exactly what was meant by “peer-reviewed” in the case of a given article.
Let’s assume you only visit the webpage of the aggregator while doing this search. This is how you interact with the trust posture of the document and its docmap. You are trying to be convinced of these kinds of statements:
a) The article I’m looking at actually did get the quality score assigned by Preprint Server 1
b) The peer reviews contained commentary that reasonably constitute a peer review
c) The peer reviews were really submitted in relation to this article
d) The peer reviews are from a reputable source
e) The peer review events were approving the article, not requesting changes
f) The article has not been tampered with since these events occurred
First, you know from the green-lock icon in your browser that your connection is secure to the aggregator. That means you can trust the claims transmitted from the specific server/host/website you are visiting. Trust is tightly coupled to message sending. The aggregator knows this, and so either takes responsibility for all these statements and other docmaps metadata, or disclaims them by stating where they came from and telling the reader to go do their own due diligence about the source.
Neither of these options is very inviting to an aggregator. To take responsibility for the content, they need to build a robust system for validating all the content they ever ingest and serve to visitors, and limit their sources of data to trustworthy partners. This “the buck stops here” is a preferable result for the visitors and is called a “centralized trust” or permissioned trust system. This resembles conventional peer review, and is why we are seeing the issues described above. In reality, many of these claims are not even robustly accounted for, leaving significant surface area for attack if anyone could find a clear way to benefit from such an attack.
Without public key cryptographic extensions to docmaps, the second option — delegating to the reader the responsibility of verifying every claim — is intractable. In many cases, readers ignore the fine print and operate as though their data source is authoritative regarding its hosted content. Although DocMaps has no ambition to solve the problem of platform responsibility which is currently under substantial public debate, the project does aim to provide efficient means of distinguishing the author of a claim from an intermediary who passes it on.
The key contribution of an integrated cryptosystem extension to docmaps is that proofs, which in this case are probably signatures by the authority making the claim, can be attached to every individual claim in a tamper-proof way. This provides three key benefits.
First, many of the statements of interest can be proven computationally without any user input, including statements {c, e, f} above.
Second, based on a preconfigured trusted identity whitelist (either a well-known one or one that tailored to platform requirements), the claims that are asserted by the editorial journal can be automatically accepted (or rejected, if you discredit that group). This covers statements {a, d} above. Though this idea of a trusted identity whitelist may sound like a big assumption, it is not more demanding than the already ubiquitous Private Key Infrastructure around DNS and Root Certificate Authorities (CAs). In fact, the system requirements for docmaps are less strict than those for a CA system.
More generally, that is to say decisions about trust can be made at the point in time when sense is made from the claims. Sense is normally made by the final reader, and DocMaps enables the final reader to make trust decisions, including to merge details into the document history from a source that the preprint server or aggregator would rather have left out for corrupt reasons. Sense may also be made by the aggregator — if they are building an index of peer-reviewed preprints, they have to make a determination of what signatories to accept for purposes of adding that label. Sense may be made by the preprint servers, if they want to offer a default trust document that includes information from sources with whom they have partnerships.
Docmaps, when cryptographic mechanisms are introduced at the level of individual and durable claims, enable all these cases. For example, this general decoupling of content from message is that the aggregator, if they don’t undertake the interpretive burden to label articles as peer-reviewed, can safely pass docmaps-y data received from heterogenous sources to their visitors. The visitors are empowered to filter or make sense of that data. It may turn out that this granular verification is more burdensome than useful, and that centralized trust authorities are valuable in the ecosystem; if so, the decoupling enables community leaders or aggregators to assume that role without forcing them to do so.
Note that claim {b} above was not addressed. Importantly, that claim includes subjective language that is out of scope of the kind of cryptographic proving I am proposing to integrate. However, I invite you to imagine mechanisms that could build on an integrated cryptosystem in DocMaps which can handle this subjectivity, which will be the subject of a future post in this series.
The full specification for such a system as I propose will be forthcoming in the RFCs for the Project. However, I will lay out here the foundational design thinking that is of concern, including how these planned extensions influence the design and best practices of DocMaps today.
Blank nodes & signed graphs
The really interesting problem germane to this blog is how to deal with signature schemes across graph data. Docmaps data is natively stored in RDF (Linked Data), which represents a directed and labeled graph. A powerful and complicating feature of RDF and JSON-LD (see above) is its allowance for “blank nodes”, which are essentially ephemeral graph nodes that represent the existence of some node at a certain juncture in a graph when we don’t know anything else about it. They are assigned ephemeral names, which start with the marker _:
, instead of having permanent identifiers. When a node lacks a stable identifier, it is given a blank node identifier as needed, but that identifier is meaningless out of context. For example, a query engine for RDF data might have to include blank nodes in each query response, and each time use numbering in that response from _:b1
, _:b2
on up, even though these identifiers are referring to a different set of nodes in each query you sent. This gets especially confusing because blank nodes can have identifiers that look meaningful. For example problems arise if you assume _:alice
always refers to the same thing.
Well-documented problems arise when applying graph algorithms where blank nodes are present [1]. The problem of signing graphs or sub-graphs with blank nodes has been studied specifically [2]. In general, the hardness of problems on RDF data reduces to the hardness of those problems on a graph of blank nodes present in that data.
For our purposes, I would illustrate this issue like this: To validate a signature against a piece of data, we need to show that the data in question matches the data which is proved, bit-for-bit, in a sequence of bits. But graph data has no natural representation as a sequence. Algorithms exist for ordering known graph components into a reproducible sequence. However when there are no restrictions on blank nodes, no known algorithm can do this canonicalization & serialization in polynomial time.
But wait! It is trivial to create such an algorithm for graphs with the constraint that there is only one blank node. (I invite the reader to try.) It turns out that the computational infeasibility of a graph canonicalization and serialization algorithm stems from certain possible arrangements of densely connected groups of blank nodes. If we can assume these shapes do not occur in subgraphs we wish to sign and verify — OR we can prevent them from happening entirely — subgraph signing becomes computationally feasible.
Empirically, I have not found evidence of these complicating cases in docmaps data. Part of the RFC for this cryptosystem will include definitive remarks as to whether they can be prevented entirely.
So let’s examine an example of docmaps data. Side-by-side I have the JSON-LD and the same content viewed as an RDF graph:
{
"@context": "<https://w3id.org/docmaps/context.jsonld>",
"type": "docmap",
"id": "<https://docmaps-project.github.io/ex/docmap_for/10.1002/essoar.10500703.1>",
"publisher": {
"id": "<https://docmaps-project.github.io/ex/publisher/unspecified>"
},
"created": "2023-07-06T20:08:51.687Z",
"updated": "2023-07-06T20:08:51.687Z",
"first-step": "_:b-4570fe1a504059297222b9911079be94166763cc",
"steps": {
"_:b-4570fe1a504059297222b9911079be94166763cc": {
"inputs": [
{
"published": "2019-02-13T00:00:00.000Z",
"doi": "10.1002/essoar.10500703.1",
"type": "preprint"
}
],
"actions": [
{
"participants": [
{
"actor": {
"type": "person",
"name": "Ekelund, Robin"
},
"role": "author"
},
{
"actor": {
"type": "person",
"name": "Eriksson, Patrick"
},
"role": "author"
}
],
"outputs": [
{
"published": "2019-02-01T00:00:00.000Z",
"doi": "10.1016/j.jqsrt.2018.11.013",
"type": "journal-article"
}
]
}
],
"assertions": [
{
"status": "published",
"item": "10.1016/j.jqsrt.2018.11.013"
}
]
}
}
}
In this docmap, we have a single Step, pointing to a single Input and Action. Note the abundance of blank nodes, which are the unlabeled nodes. This is typical in graphy serializations of JSON-LD where many objects have sub-objects that have properties but aren’t named (for example, the assertions
key in the docmap points to “something with a status
and an item
” but that something has no ID). Notice that the blank nodes generally form trees rather than cycles, which is preferable from the computational standpoint discussed above.
First of all, what would we like to sign?
It is most useful to create signatures for data we can commit to. A record of an event in the past is a good candidate. Let’s try to canonicalize and serialize the graphy contents of the action
. Note that actions being blank nodes is currently common, but doesn’t need to be and indeed I will be recommending assigning IRIs (identifiers) to Action nodes as a further best practice. Here is the sub-graph containing just the data in the actions
field:
I have included an assignment of blank node identifiers of the six blank nodes, but keep in mind that a graph that shuffles all the blank node labels must be recognizable to analytical software, such as the signature scheme, as the same graph to this one. This assignment was arbitrarily given during graph construction on my laptop. Also, node some predicates (edge labels) are omitted for readability.
We can sign any in-memory representation of this graph that can be definitively tested for equality with another representation of the graph. Since RDF is natively represented as triples, we can try working with that format. Here is the subset of triples that describe this graph from the action on down (with prefixes expanded):
_:b2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
_:b2 <http://xmlns.com/foaf/0.1/name> "Ekelund, Robin" .
_:b3 <http://purl.org/spar/pro/isHeldBy> _:b2 .
_:b3 <http://purl.org/spar/pro/withRole> <http://purl.org/spar/pro/author> .
_:b4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
_:b4 <http://xmlns.com/foaf/0.1/name> "Eriksson, Patrick" .
_:b5 <http://purl.org/spar/pro/isHeldBy> _:b4 .
_:b5 <http://purl.org/spar/pro/withRole> <http://purl.org/spar/pro/author> .
_:b6 <http://prismstandard.org/namespaces/basic/2.0/publicationDate> "2019-02-01T00:00:00.000Z"^^<http://www.w3.org/2001/XMLSchema#date> .
_:b6 <http://prismstandard.org/namespaces/basic/2.0/doi> "10.1016/j.jqsrt.2018.11.013" .
_:b6 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/spar/fabio/JournalArticle> .
_:b7 <http://purl.org/spar/pro/isDocumentContextFor> _:b3 .
_:b7 <http://purl.org/spar/pro/isDocumentContextFor> _:b5 .
_:b7 <http://purl.org/spar/pwo/produces> _:b6 .
This is a nice serialization of this graph, but it isn’t the “only” one, because the triples can appear in any order and the blank node labels can be reassigned arbitrarily. Can we fix this? One naive algorithm for canonicalizing these triples might be: sort all the triples bitwise as if all the blank node labels are removed; then relabel all blank nodes starting from _:b0
in order of appearance in the text.
That algorithm runs in O(m * n * log(n))
with n
triples of length bounded by m
. Such an algorithm doesn’t actually work in all cases; a correct algorithm is more extensive and runs in O(m * n^2)
in the ideal case that, like our example, the data under question is a tree. In fact, it does not even canonicalize our small example correctly; there are multiple valid renderings of our graph by this algorithm. (As an exercise, consider why this is. See if you can convince yourself that no algorithm can exist which canonicalizes a graph’s triple form in O(mnlog(n))
.) However let us pretend for concision that this is a correct algorithm, producing this canonical form:
_:b0 <http://purl.org/spar/pro/isDocumentContextFor> _:b1 .
_:b0 <http://purl.org/spar/pro/isDocumentContextFor> _:b2 .
_:b2 <http://purl.org/spar/pro/isHeldBy> _:b3 .
_:b1 <http://purl.org/spar/pro/isHeldBy> _:b4 .
_:b0 <http://purl.org/spar/pwo/produces> _:b5 .
_:b1 <http://purl.org/spar/pro/withRole> <http://purl.org/spar/pro/author> .
_:b2 <http://purl.org/spar/pro/withRole> <http://purl.org/spar/pro/author> .
_:b5 <http://prismstandard.org/namespaces/basic/2.0/doi> "10.1016/j.jqsrt.2018.11.013" .
_:b5 <http://prismstandard.org/namespaces/basic/2.0/publicationDate> "2019-02-01T00:00:00.000Z"^^<http://www.w3.org/2001/XMLSchema#date> .
_:b5 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://purl.org/spar/fabio/JournalArticle> .
_:b3 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
_:b4 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
_:b4 <http://xmlns.com/foaf/0.1/name> "Ekelund, Robin" .
_:b3 <http://xmlns.com/foaf/0.1/name> "Eriksson, Patrick" .
With a canonical form like this, it is easy to produce a hash reduction that is consistent: the SHA256 checksum is c979514d294f59d6c740a20c0a3465161e3878ba11eca26f8e50fb66ce675cd7
. That is easy enough to sign: M = SIGN(private_key, sha256(canonicalize(graph.nt)))
. To ask whether a given identity has signed a graph under question, with access to this signature we can then verify: ACCEPT? = ( sha256(canonicalize(another_graph.nt)) == VERIFY(public_key, M) )
.
In the situational example with a review group, aggregator, and preprint servers, the aggregator may wish to make available the entire docmap in our concrete example, but the journal may be the party who takes responsibility for the assertion of the Action that produces a Journal Article. (Note that the eventual docmap would most likely have multiple steps, not just the one step in the example.) Under the cryptographically extended DocMaps protocol, the Action node might have the additional fields
{ "participants": ... ,
"outputs": ... ,
"integrity": {
"sha256": "c979514d294f59d6c740a20c0a3465161e3878ba11eca26f8e50fb66ce675cd7",
"signature": <M>,
"public_key": "ed25519 ...."
}
}
With this information included, the aggregator can now decide whether the public key matches an identity on record of a trusted party before indexing the manuscript into its “peer-reviewed” search. Likewise, the final reader may have their own trusted public key list, and would be able to disregard the “peer-reviewed” status if they felt the journal was disreputable.
Note that the exact presentation of this information which I will recommend in the RFC is still under research. It is likely to be more elaborate than this. For example the assertions
in a step
will most likely be more interesting than just the actions alone from an integrity point of view; and the assertions may rely on multiple actions, giving rise to a need for some way to bridge actions with integrity information into more abstract claims to verify.
This is an area I am still actively researching in the context of DocMaps data. So far I have not yet seen problematically structured data, but it is easy to imagine. For example, if the RDF dataset is not a pure docmaps dataset but heterogenous, the foaf:Person
nodes attached to the action may have other edges to other things the dataset knows about them. Carefully constructed queries will be required to choose what statements to include in a signature, and to avoid including cycles of blank nodes that might be attached to those nodes.
Luckily, it is very ergonomic in JSON-LD to express structures with trees of blank nodes but less so to express other structures. In order to create structures in JSON that aren’t trees, you have to refer to elements by name, and referring to blank nodes by name is unusual. This is currently common practice with step names, which are typically blank nodes. A change to this pattern will be part of the RFC, and as an interim measure I currently recommend using IRIs to refer to steps.
How can we address these?
standards for structuring docmaps (worst choice, flexibility matters)
couple signatures to the lowest level elements possible
generally disallow signatures on graphs with blank nodes in certain situations
extract subgraphs and sign them in a standard encoded form, and attach that signed-subgraph into the main graph
Start assigning stable IRIs to things you might later want to sign. IRIs should be URLs, but can simply include a GUID. More info about IRIs and docmaps servers at the API Server RFC.
Start assigning IRIs to Steps, even if they are unstable.
Keep your docmaps data as tree-like as possible. Avoid using inverse RDF statements to the ones named in the DocMaps @context document. (i.e., be consistent about using pwo:hasStep
rather than pwo:isStepOf
).
Use a consistent structure for elements you might wish to sign.
Avoid extremely deep trees of data where possible.
Use versioning. Keep old entities around, but create new nodes in your graph or database.
PKI management
Any Public-Key Infrastructure system (PKI) involves the use of secret keys in the possession of someone wishing to make a tamper-proof, attributable claim. Institutions issuing docmaps data at scale will maintain one or more of these keys, and the compromise of such a key would enable an attacker to make claims that appear to be authentic (”this article is peer reviewed”). There are several strategies to mitigate this risk. At a design level, it should be expected that signing keys have a short enough lifetime to require regular rotation, and bit rot due to very-long-lived signing keys should be accounted for and presented within the docmap. In the event of major key breaches, a docmaps datastore should expect to re-issue signatures for old data that is still valid.
Design of a flawless PKI is out of scope of this project, but choice of a good scheme that balances the various needs and risks of the project will inform the forthcoming RFC’s recommendation of how identities are managed.
Drift towards centralization
While DocMaps is designed to work in an unpermissioned, decentralized way, management of large numbers of identities and signing keys, as well as decision making about who to trust, is laborious. The majority of DocMaps readers are unlikely to want to customize trust settings and may prefer to delegate their trust configuration to a general “consensus” or specific trust authority. In this situation, the de facto trust infrastructure may resemble today’s global TLS Certificate Authority situation, where a small number of technocrats control the keys to the kingdom (and global secure browsing is possible because of it).
Echo chamber problem
If trust delegation becomes widespread, and not one but a small number of bodies emerge as competing standards, this type of integrity checking will make it easier for DocMaps to be exploited to turn insular communities further inward. Although DocMaps can’t prevent the flow of information, it may be easier to “opt out” of valuable content and participate in the creation of competing realities. However, I consider this to be a fundamental risk of empowering people, and enthusiastically invite discussion as to how best to balance it.
Is this compatible with anonymous peer review?
Yes. Peer review groups may still take responsibility for anonymizing their reviewers and attesting to the reviews as a collective. Furthermore, recent advancements in cryptography can even permit individuals to sign documents while proving statements about themselves, such as possession of a certain credential or membership in a group, while remaining otherwise unidentifiable.