SafeAeroBERT

SafeAeroBERT: A Safety-Informed Aviation-Specific Langauge Model

Available at: https://huggingface.co/NASA-AIML/MIKA_SafeAeroBERT

base-bert-uncased model first further pre-trained on the set of Aviation Safety Reporting System (ASRS) documents up to November of 2022 and National Trasportation Safety Board (NTSB) accident reports up to November 2022. A total of 2,283,435 narrative sections are split 90/10 for training and validation, with 1,052,207,104 tokens from over 350,000 NTSB and ASRS documents used for pre-training.

The model was trained on two epochs using AutoModelForMaskedLM.from_pretrained with a learning_rate=1e-5, and total batch size of 128 for just over 32100 training steps.

An earlier version of the model was evaluted on a downstream binary document classification task by fine-tuning the model with AutoModelForSequenceClassification.from_pretrained. SafeAeroBERT was compared to SciBERT and base-BERT on this task, with the following performance:

Classification Metrics
Contributing Factor	Metric	BERT	SciBERT	SafeAeroBERT
Aircraft	Accuracy	0.747	0.726	0.740
	Precision	0.716	0.691 -	0.548
	Recall	0.747	0.726	0.740
	F-1	0.719	0.699	0.629
Human Factors	Accuracy	0.608	0.557	0.549
	Precision	0.618	0.586	0.527
	Recall	0.608	0.557	0.549
	F-1	0.572*	0.426	0.400
Procedure	Accuracy	0.766	0.755	0.845
	Precision	0.766	0.762	0.742
	Recall	0.766	0.755	0.845
	F-1	0.766	0.758	0.784
Weather	Accuracy	0.807	0.808	0.871
	Precision	0.803	0.769	0.759
	Recall	0.807	0.808	0.871
	F-1	0.805	0.788	0.811

More infomation on training data, evaluation, and intended use can be found in the original publication

Citation: Sequoia R. Andrade and Hannah S. Walsh. “SafeAeroBERT: Towards a Safety-Informed Aerospace-Specific Language Model,” AIAA 2023-3437. AIAA AVIATION 2023 Forum. June 2023.