Skip to content


Improving interdisciplinary understanding using AI

As with all IT jobs, we’re encouraged to use “more AI” without anybody being entirely sure what it can bring to the party.

I’ve been learning a bit about both using it to support coding — it’s great at doing the grunt work in making a test file with everything mocked correctly and all the paths through the code accounted for with a test. Although you still need to check the tests make sense.

But I’ve also been thinking about what data we have access to in our team’s role of “Research Application Support”, and the obvious one is our research output metadata. Our Pure/EPrints systems contain 200,000+ records so there’s a lot to play with.

I took title, authors and abstract from 100 records and played with various instructions to the AI. I ran the most successful ones over all 100: One to get a laypersons summary of what it’s about and what makes it interesting/novel, and a second to pull out glossary terms and explain them to a lay person.

What makes it interesting is that the glossary terms come out with context for the paper in question so words like “drive” are defined in context, rather than a dictionary definition.

Demo

Try my demo of 100 EPrints records with AI definitions, summary and analysis.

Good bits

Overall it seems to work very well. I think the “definitions” are the stand-out win. The fact that the LLM will define them in context rather than a dictionary definition is a massive win.

Issues

The following phrase from an abstract means nothing to me. I’m annoyed the AI didn’t give us a definition for “carciogenesis”

USP7 also promotes carcinogenesis through aberrant activation of the Wnt signalling pathway and stabilization of HIF-1α

It gave us a contextual defitionion for “Wnt signalling pathway”, but the explanation is hardly understandable.

A signaling pathway involved in embryonic development and tumorigenesis, aberrant activation of which by USP7 can promote carcinogenesis.

It’s also important to remember that it only had access to the title and abstract. Where this is very slim, eg. for the “Dataset for: High Temperature Secondary Lithium-ion Batteries Operating Between 25 deg C and 150 deg C” it’s analysis described the dataset as “valuable” which seems irresponsible to me as it’s just made that up and taken the word valuable from my question, not anything in the data I gave it. It does much the same for “Dynamic Topology Optimization Model DataSet”. I think we would want to engineer prompts that were very conservative for this part, if we used it at all. Nothing in the summary should not be entirely backed by the text of the title+abstract. More realistically, we should probably lose the analysis or entirely change the question entirely.

The definitions sometimes are hit and miss on what’s defined. Some “hard” terms are not defined and some things are kinda obvious, or tautological. I might update the promt to ban tautologies like: “midwifery leaders: professionals who hold leadership positions in the field of midwifery”.

Prompts

Everybody always likes to ask what prompt you used. Unlike a chatbot interface, when using the API you can specify a role and prompt. The prompts were the record title & description. My “role” instructions  for definitions was:

Given a research paper metadata, you return a list of the domain-specific terms in the title or abstract.

By domain-specific we mean terms whose usage differs from the common-use meaning. For example, “theory” in science doesn’t mean the same as “theory” in common speech.

The descriptions should be short and aimed at an intelligent high-school student.

Output the list only, with no additional comments.

Please output each definition as follows:

TERM: term text
DESCRIPTION: description text

(Repeat as needed.)

List as many definitions as is useful. Only list definitions that are not commonly known. Definitions should be on a single line.

And for the summary I used the following. I tried making two calls, one for summary and one for analysis but I found that tended to repeat much of the same or similar text in both, so asking for them together encouraged it only to include in the analysis content that wasn’t in the summary.

Given the title and abstract of a research output, you provide a very concise summary suitable for an intelligent person educated to high-school level to understand what the research was about including a single concise sentence about why this research is notable and interesting. Do not use the words notable or interesting. Do not use run-on sentences. Please keep the text very easy to understand. Please output each result as follows:

SUMMARY: summary text
NOTABLE: notable text

Obviously if running

An EPrints Plugin?

I’d love to make a plugin for EPrints which did this when a record was created and stored it with the record, given the creator of the record, and editors, the chance to correct and/or approve definitions, show them on the public web with that context (approved by a human/ generated by AI, not checked) and the chance for anybody reading the site to flag iffy definitions for review.

I’ve talked to a few people about it and have some good features suggestions:

  • Definions are created when the abstract and title is first created, and are saved as part of the record.
  • The author has the option to approve, edit and delete the AI generated definitions of terms, and add their own
  • The reader is told that the definition is human written, ai created+human edited, ai created+human approved, ai created+unreviewed
  • One friend suggested any website visitor should be able to flag any AI definition they thought needed review.

My thinkning is that this could be quite a simple EPrints plugin; 

  • add a trigger to create a definition section in the record
  • editing definitions is a bit more complicated as it’s not quite clear where it would come in the workflow. AI lookups are slow so you kinda want them running in background, but you don’t want a background process editing an eprint record as you are entering it into the system.
  • probably some way to manually trigger generation on individual records, or large numbers of records.
  • I’m thinking if we have summary & analysis they appear in the same table as the definitions but with ‘magic’ names. That way they get the same “edited, approved” style flags as the AI generated definitions.
  • The rendering of the definitions would need to appear on the summary (AKA “abstract”) page, and could either be done with a custom new way to print fields from the EPrints citations, or as a glossary at the bottom of the page, and some javascript to make the abstract interactively highlight words and phases with a gloss.

Final thoughts

My thinking is that this massively reduces the cognitive load in understanding what a paper from outside your discipline is about, and if it might be worth the effort of delving into.

 

Posted in Uncategorized.


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.



Some HTML is OK

or, reply to this post via trackback.