Show HN: Atlas of Water Science via generative AI
wateratlas.webapp.csiro.auThis is our Atlas of Water Science. It's a globe mapping water science that colleagues have done over the past few years. Approx 300 papers got analyzed. The pipeline sends open access papers to an LLM asks the LLM to extract areas relevant to science in a form suitable for a geocoder, then we use a geocoder to get lat lons. When things happen in the same spot we jitter the locations a bit to differentiate things. gpt4o-mini was used for most of the analysis. Some of the features on the website are directly invented by Claude. There are some errors when geocoding went wrong or locations were extracted incorrectly.
We are working on this at CSIRO, Australia’s science agency. I think it’s a reasonably novel integration of geospatial ideas with LLMs even if the actual application here is a bit niche.
Nice looking site!
Has the code been open-sourced - how could I recreate this visualation for other keywords?
The backend is still a mess of code, so no. It's not too hard to do though. The prompt I used extract location is
"The text provided are enviornmental science papers. They often (but not always) will include references to locations where this science is relevant, for example a study might be of soil around a small town, in this case the town would be the relevant location, extract all locations that are relevant or the subject of the science done, do not extract any locations that are related to the location of or institutions, organisations, or laboratories. So for example exclude the location of government departments and CSIRO laboratories. If there are no relevant locations, please return an empty array. Each location should be extracted in a form suitable for calling the Nominatim geocode API in Python via geopy. Also, extract out a short context string that describes the context in which this location is referenced. Please provide the output in JSON format."
Then I passed it through both Nominatum and Google Geocoder. Google worked better.
One thing that didn't work great in the prompt above was excluding the location of places where the authors worked. They sometimes got included anyway.
> One thing that didn't work great in the prompt above was excluding the location of places where the authors worked. They sometimes got included anyway.
Have you tried adding the institutions as an explicit property in the JSON response and just ignoring the second list?
I’ve had much better luck with having LLMs explicitly choose a different label when working with similar types of entities than asking the LLM to exclude them via prompting. This way you can also spot ambiguity if the LLM add a location to both arrays.
I have not done that but I like that strategy not just for this use case but as a general idea for replacing exclusion with finer grained categorisation. One thing I did do is use a regex to preprocess the papers to remove bibliographies which were a really big source of noise. In titles of referenced papers there would often be a mention of location that was not directly relevant to the paper itself.
The Atlas is also trying to answer the question "Can we build inaccurate and incomplete systems with LLMs that are still useful?".
Cool project. Note that you can force structured output now instead of asking for json: https://platform.openai.com/docs/guides/structured-outputs
Thanks, structured output makes a lot more sense. The pydantic approach at the link looks straightforward.