Technical review and vision at the end of year one
This winter, the DocMaps Project full-time team will scale back and go into maintenance mode. The Project’s open-source repositories will still be available at GitHub and NPM and accepting contributions. We actively welcome organizations invested in DocMaps who would like to take over stewardship of the roadmap.
This post primarily serves as a discussion of my observations and assessment of the preprint science community’s infrastructural needs with regards to DocMaps, and the important themes around which I recommend organizing future work on the Project.
But first, how I think about this year’s progress:
We coordinated an RFC-based protocol governance model. This repository can be a site for public discussion about changes to vocabulary, paradigms, and abstractions on top of the minimal DocMaps exchange format specification.
We built a Typescript SDK and parsing/utility library! This library has been announced elsewhere. For today’s purposes, the important thing is that the library establishes a shared source of truth about how to use DocMaps in Typescript. It is still the case that multiple languages and different custom encoders are in use at different shops that deal with docmaps. However, this library sets a new standard for interoperability. The more we can all come to use the same core libraries for basic encoding work, the less ambiguous and more effortlessly aligned our DocMaps work will become.
We wrote a recommendation for an HTTP-based API for servers to interoperate. This spec describes how dataset owners can surface multiple verbs and endpoints together for further seamless interoperation. We have an in-progress reference implementation that is built on top of SPARQL-based graph datastores.
We designed and built a WebComponents widget that visualizes a DocMap, and which can be embedded on any website that wants to give viewers a very fast insight into the amount of editorial/peer review activity that has been done on a document. This widget is designed to work out-of-the-box with all servers conforming to the API spec, and can be made to work with nonconforming servers.
Four infrastructure organizations — CSHL, eLife, EMBO, and ePMC — are operating DocMaps technology, supporting a dozen or more review groups and orgs in issuing DocMaps content relating to thousands of preprint artifacts in their daily work.
A pathway forward for DocMaps, given an amorphous community of stakeholders, needs to remain grounded in the goals and purposes of the Project, even as the needs of stakeholders change. Therefore my recommendations for where to go next are in one part about how we can make the tools work better for the known aspects of the problem space, and the other large part about what we’ve learned to be emerging and forthcoming demands on the Project’s capabilities.
I frame my thinking on this project around the concept of separating “signal” from “noise”. A major theme of DocMaps is to enable preprint science that is both fast and trustworthy. This is complicated in an AI world where (almost) any imaginable document can be presumed to exist, and therefore the existence of any document is not in itself significant. In terms of signals, this pollution is a huge explosion in the noise term relative to signal. The key is to create documents that are highly unlikely to exist unless significant findings have occurred. This depends on sources of trust outside the realm of computation, and computable programs that can operate on arguments made by trusted parties.
Other projects are thinking in similar terms.
I have written more extensively about this theme in a previous post. Concisely, I argue that DocMaps themselves need to support offline authenticated message passing and verification. We currently delegate all authenticity information to the HTTPS layer, which means that a DocMap or part of one (such as a description of one publication event) can never be taken away from its point of origin and remain integral and trustworthy. This problem is typical of all publishing platforms, and addressing it is a substantial value proposition for DocMaps. Since cryptographic signatures are defined in part by their improbability to arise from noise, they are good primitives for this work. An AI can never forge a believable cryptographic signature.
A related roadmap item for public comment is here.
A DocMap is like a “manifest” or bill of materials, describing the history of a document and (ideally) proving that those statements are authentic and come from a source that a reader may wish to believe. However, they currently do not do much work to help in making sense of those materials. There is room in a docmap for a publisher to write, for example, “this document is now Peer Reviewed”, but that claim must be taken or left along with the rest of the document based on the author’s signature. This means that sense-making that relies on reasoning about statements from two different sources must be done by the reader, and these programs should be publishable to the reader as well.
A related roadmap item for public comment is here.
In the success case, we expect docmaps to get larger and more colorful with heterogenous contents. Programmatic sense-making as described above will probably struggle to keep pace symbolically, but is a natural application of language models (large or small). Recent technology permits the work done by such models to be verified in the same way as statements made by a cryptographic identity. We expect this technique to provide valuable trustworthy semantic automation to the DocMaps ecosystem.
A related roadmap item for public comment is here.
There is work to do to support identifying and de-identifying the participants in a scientific process while maintaining trust. There is already growing consensus in some preprint science communities on the value of seamless integration with ORCID attribution for authors, which DocMaps needs to support. However, a complication is that DocMaps verification must remain offline — this integration must allow a publisher to include information from ORCID’s databases that is trustworthy without separate contact with ORCID by the reader. This is currently unsupported because, as described above about DocMaps server integrity, the ORCID API does not serve cryptographic information, instead relying on HTTPS to be the source of trust.
A related issue is preserving anonymity of peer reviewers when you want to prove those reviews are done by qualified individuals, especially in a decentralized fashion. This theme is of moderate interest but requires significant further exploration. I have left some breadcrumbs on the roadmap item below indicating how I would go about looking into this.
DocMaps technology is currently built around JSON-LD because it can be handled by RDF-native tools and JSON-based tools. However, this flexibility has been a source of some precarity, because RDF is viewed disfavorably by many organizations due to its unfamiliarity. Most organizations, for example, store DocMaps data in relational databases using SQL and compose JSON-LD out of the related objects; whereas in the core DocMaps OSS libraries, we have strived to be RDF-Native where possible, and our server implementation uses a SPARQL-based datastore.
This inconsistency means we can’t expect consistent behavior across all cases, and support will likely be a challenge. I anticipate that core tooling will be obliged at some point to support relational databases, and unless significant interest is driven in the RDF space in the next few years, it may not be practical to support triplestores in the backend long-term. Relatedly, algorithms to generate DocMaps from graphy formats that we are relying on in the ETL and server code need to be written more abstractly to handle both of these cases if these tools get more adoption from OSS users.
I have been collecting stories and speculations from preprint science leaders as to how today’s scientific process has changed and will continue to change. Preprint science is a big step into the unknown for collective knowledge-making, because it decouples the process from the economics of journals, but we still mostly think about our process in journal-like terms, such as peer review. Since it is open and fast, preprint science has potential for rapid evolution. I see two major areas of interest here for DocMaps, as we strive to provide valuable infrastructure for trust in science.
First, many conversations have touched on the rise of multimodal publication objects — that is to say, preprints that are videos, interactives, or other artifacts besides traditional papers. This is true for inputs as well as outputs — papers that cite news sources, cultural text such as social media, and so on. DocMaps has a general goal to support these expanding dimensions but does not yet have a clear vision for what that will look like. Thinking again of signal-to-noise, it is possible that where the signal is — where the science happens — will increasingly drift in this direction. We should not as a community be too attached to our current forms, although a critical eye to how new and old forms serve us is warranted. The science is what matters, as is the growth of a shared body of knowledge and shared sense of reality.
Second, flexible research methods give rise to flexible research roles. The duality between creative roles (researcher, author) and curatorial roles (reviewer, editor) is not always as clear as it once might have been. Acknowledgment paragraphs frequently include editorial readers who have served a peer-review-like commentary purpose “in the middle” of a scientific process. But with preprints, it is not particularly clear when a process has ended — we go from the middle to the middle in many processes, and the notion of ending at publication is just a formality. This formality is further subverted by post-publication review activity. DocMaps in its best mode makes this complexity more manageable, and so is likely to only speed this up. Simultaneously, the Project is for the moment committed to inherited terminology that is itself brittle to this change. Therefore it will be an interesting challenge for this Project to both facilitate and adapt to less strictly separated roles.
Rising complexity of document processes (especially processes that with skew in the time-domain) implies that all forms of creativity and curation look increasingly like essential parts of the scientific process. I and other collaborators foresee a direct relationship between this trendline and decreasing rigidity in roles.