- Notifications
You must be signed in to change notification settings - Fork7
Code and datasets for paper "GeoGalactica: A Scientific Large Language Model in Geoscience"
License
geobrain-ai/geogalactica
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
To clarify with potential confusions, we hereby state that the model with the manuscript "GeoGalactica: A Scientific Large Language Model in Geoscience" is not associated with the DDE Program, nor supported by DDE related fundings. We feel sorry for the unintentional misunderstandings and inconvenience, and we commit to prevent future misunderstandings.
GeoGalactica is from further pre-training of Galactica -- a top-performing LLM trained with a large number of scientific documents. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an open-sourced LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain.
- Paper:https://github.com/geobrain-ai/geogalactica
- Data:https://huggingface.co/datasets/daven3/geobench,https://huggingface.co/datasets/daven3/geosignal, andhttps://github.com/zthang/geotools
- Model:https://huggingface.co/geobrain-ai/geogalactica
- Checkpoints:https://huggingface.co/geobrain-ai/geogalactica-ckpt
- Plot:https://github.com/dbylynn/GeoGalactica_Analysis
- Sciparser:https://github.com/davendw49/sciparser
A simple script is provided (tools/prediction/demo.py
) for the model to predict the output text for a single input. The memory exceeds 140GB.The folderexample_data
shares data file format during the training.
This project was founded by Acemap at Shanghai Jiao Tong University, leading byZhouhan Lin and a group of students includingCheng Deng* (student leader),Le Zhou,Tianhang Zhang,Yi Xu,Yutong Xu,Beiya Dai,Qiyuan Chen,Yuanyuan Shi andZhongmou He supervised byZhouhan Lin,Junxian He, Xinbing Wang, and Chenghu Zhou.
GeoGalactica has referred to the following open-source projects. We want to express our gratitude and respect to the researchers of the projects.
- Facebook Galactica:https://galactica.org/
- Facebook LLaMA:https://github.com/facebookresearch/llama
- Stanford Alpaca:https://github.com/tatsu-lab/stanford_alpaca
- alpaca-lora by @tloen:https://github.com/tloen/alpaca-lora
- alpaca-gp4 by Chansung Park:tloen/alpaca-lora#340
- K2 by Cheng Deng:https://github.com/davendw49/k2
We would also like to express our appreciation for the effort of data processing and annotation from the students in CAS.
GeoGalactica is a research preview intended for non-commercial use only, subject to the model License of Galactica and the Terms of Use of the data generated by OpenAI. Please contact us if you find any potential violations. The code is released under the Apache License 2.0. The data GeoSignal and GeoBench is open-sourced byK2.
About
Code and datasets for paper "GeoGalactica: A Scientific Large Language Model in Geoscience"
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.