Scanning Project Blue Book (and a cool case we found)


Hi all,

My (very) small team and I are working on scanning the Project Blue Book case files using GPT-4-vision for Optical Character Recognition (OCR) and we want to let the r/UFOs community know about it. Other OCR text scanning software has performed very poorly on most of these documents so far, but OpenAI’s model opens up a new opportunity to further digitize this data.

We are certain that due to the enormous hassle of looking through these 10,000+ case files, the general public is unaware of some amazing case stories.

We hope by scanning the files, public access to them is increased and further processing (summarization, categorization, etc.) is possible. We also hope to provide the data in a searchable format for ease and accessibility to the public.

What is Project Blue Book?

For those who don’t know, Project Blue Book represented a part of a comprehensive series of investigations into unidentified flying objects (UFOs) carried out by the United States Air Force. Commencing in 1952 and ending in 1970, it marked the third such study of UFOs, following Project Sign (1947) and Project Grudge (1949). The documents we have include Sign, Grudge, and Blue Book.

The primary objectives of Project Blue Book encompassed:

Assessing whether UFOs posed a national security threat Conducting a scientific analysis of data related to UFOs

The project was headed by Dr. J. Allen Hynek, who started as a staunch skeptic on the UFO matter, only later to become a famous public advocate for the study of UFOs. He claimed that the Air Force was also using the project to discredit UFO stories.

The Project Blue Book case files were declassified and made publicly accessible by the United States Air Force in the early 1970s. After the termination of Project Blue Book in January 1970, the U.S. Air Force initially transferred its records to the U.S. National Archives. The files were made accessible online in 2015.

Progress so far

Locating the original Project Blue Book files posed some initial challenges. They were previously hosted on the famous Black Vault website, but the link is dead.

Fortunately, we discovered that the files are still accessible on the Internet Archive at this link: Project Blue Book Archive.

After the downloads finished we were excited to see some copy-pasteable text. We started processing the files, but to our dismay, the vast majority of the text in the docs looked like this when copied and pasted:

• (

• • ~ . 6 .. •

• ””J-·’ J • • ‘ : ‘(

• -·— —-. -.. .. —-··· ——-·-. -· —– •

‘ .. . . –

:·t·-· .. ·,14′ —-·———. —–·—–··—·—-·—-· ——·-

—- . ‘ . —-…. ——·. ———-·—· ——-….

.. ,,,”‘ J • • • ‘) .

Gibberish.

That’s because the documents look like this:

Incident #266, March 4, 1949, Red Bank, New Jersey

That’s when we decided to turn to GPT-4, the most advanced model from OpenAI that also forms the basis of ChatGPT. The model GPT-4-vision can describe images and read documents etc. We tested out some examples and it looks beautiful!!

Here is a snippet of information from this file scanned with GPT-4-vision.

( PROJECT STORK T O ) -Incident# 266

ATE [possibly ATTN.] NO. ______ ( ) DATE OF INFO 11 March 1957

AFD NO. _______ ( ) SOURCE Red Bank, New Jersey

REPORT NO. ______ ( ) SOURCE Civilians

TIME & DATE TITLE OF EVENT

TYPE OF SIGHTING _______ COLOR Dull white

SHAPE Round SOUND Very Faint

SIZE 30″ Diameter Plate ALTITUDE 1500 Elevation

DURATION (TIME OBSER.) 60 Seconds

NO. OF OBJECTS 1 TYPE OF OBSERVATION Ground

MANEUVERS Made Alt and Slight Turn

(MONITOR AFSWP FOR MIV 339

( 26 Jan 52)

[Barcode-like image]

Aircraft

Not gibberish and a huge improvement!!

Notice, though, that it’s *not perfect*. The date is wrong in the top right, and “Project Stork” is not a thing. But, we are working on improvements and we are very happy with the quality of most of the data!

Plus, we can always reference the original document for confirmation, and data on other pages in the reports will help keep things consistent.

Another thing we are experimenting with is getting text-based information when photos appear in the documents. This would make it possible to search for documents with certain types of photos.

File 2853, November 1, 1945, Tom’s River, New Jersey

GPT-4 tells us this:

1. The top-left photograph is highly overexposed or washed out, making the details very difficult to discern. There appears to be a blurry landscape or background with indistinct features.

^ Correct!

2. The top-right photograph depicts a tree line with bare branches, and there’s a dark elongated object in the sky just above the trees. The object has a somewhat horizontal orientation and is much darker than the surrounding sky.

^ Okay… that’s actually a staple.

3. The middle photograph shows a landscape with what appears to be three dark, unidentified aerial objects in the sky. The objects are irregular in shape, and their positioning does not conform to any known aircraft formation. The background of the photo includes a landscape with a visible horizon, suggesting the photo was taken in an open area.

^ Nice description!

Why do this?

Mass interest in the topic: we want to make this huge set of cases accessible to researchers and enthusiasts Demonstrate how this method of document parsing can be used and share techniques for accuracy. Could be applied to any old set of scanned documents with difficult-to-read text. We encourage you to think of other document sets that can be made more accessible with these methods! Get text-based information about photos in the documents for searchability

Finding interesting cases (we already found a cool one!)

One of the cool things we can do with text data is send it to chatGPT to summarize cases and make it easy to find interesting case files.

For example, we already found a fascinating case from the former Watson Laboratories in Red Bank, NJ, a 1940’s air defense laboratory specializing in military radar and detection systems.

We had never heard of this case and could not find any information about it online. This is what chatGPT told us about it:

main event: Multiple civilian employees of Watson Laboratories reported observing a disc-shaped, snow white object moving at a very high rate of speed in the sky on March 4, 1949. The object was seen making rapid maneuvers and disappearing from sight after about a minute of observation. interesting points: The object’s observed maneuvers included making a loop-like turn and heading north-west at a jet-like rate of speed, without any apparent propulsion vapor or heat waves. The observers were qualified personnel, including radio mechanics and an administrative assistant, employed at the laboratories for a minimum of two years, and were considered qualified observers.

With these summaries, we can easily pinpoint cases of interest, look at the original document for verification, and investigate for more details.

We love this case. It really sounds like it should be added to the pattern of UFO sightings above above sensitive military faciities, right? This was a high tech air-defense lab and some of the workers themselves told the Air Force about seeing a high-speed flying white disc above the facility?!

Not only that, this is the very lab that created the very high-altitude atmospheric balloons designed to detect nuclear tests that debunkers claimed were the remains that were discovered at Roswell. So the people working on the top-secret balloons that you claim were found at Roswell are also out here seeing flying discs above their own facilities 2 years later? What the hell?

If anyone, workers at that specific lab would know what an airplane looks and sounds like (hell, maybe even experimental craft). Was it caught on their radar systems? This definitely deserves further investigation.

Look for cases matching those in the media now

If a new type of case comes out like the recent “Jellyfish” UFO, it would be great to immediately search all 10,000+ bluebook case files to see if anyone else spotted similar phenomena during the blue book years. Documenting and cataloging them could take a mere 30 mins to an hour instead of months.

Similarly, while maybe they wouldn’t be called Tic Tac UFO’s, are there cases that match the 2004 Nimitz incident in any way? With this project we might be able to find out more easily!

Time and money

The only roadblocks now are only time and money. There are close to 130,000 pages in this collection.

Doing some quick math based on the cost of parsing the initial tranche of data, we think this will cost around $3,000-$4,500 USD. OpenAI has been decreasing prices over time, so this may improve, and we plan to try out various cost-cutting techniques.

Github repo

We plan to make the files publicly available, and as a sign of that we have uploaded over 1100 pages of parsed reports so far to our github.

https://github.com/dansterdam/blue_book_scanner

Look in the folder data/scanned_casefiles for the data.

Current outlook

We are an independent free-time-only team and are here to just talk about our project and provide updates, but if there are some interested people here, donations would directly help us to continue the work! In any case, we will continue the scanning with limited funds and return here to post some more info on casefiles.

The pace of this process will not be lightning-fast, we can process only a few hundred pages per hour, but it’s leaps and bounds ahead of manually transcribing the documents. Despite the time and financial commitment required, we’re absolutely thrilled about the potential outcomes.

We think this could generate some interest and excitement in the community when people gain access to intriguing UFO reports from their own towns across the United States and beyond. These reports have, until now, been a challenge to explore (some might even say an enormous pain in the ass) but we’re on the cusp of changing that!

Donations link!

https://www.buymeacoffee.com/projectbluebook

Outcomes

Project Blue Book’s 10,000+ case files (130k+ pages) scanned into text format Periodic updates on the process and uploads of newly scanned data Data uploaded to our github in an organized format for ease of download and searchability Nice to have: data indexed to a database and made available on a website for ease of browsing and searching. this would incur recurring costs and definitely take some more time and expertise than we have at the moment, but we really want to get here Suggestions are welcome! But we don’t guarantee more than the top three points 🙂

tl;dr

We’re digitizing and scanning over 130,000 pages of Project Blue Book UFO files using GPT-4-vision for accuracy and accessibility. Our goal is to make these files easily available for public research. We’ve faced challenges with the high volume of pages and costs, but have made progress and uploaded some files on GitHub which we will continue to do. We’re aiming to hopefully build a searchable database of these UFO reports. We’ve already found an intriguing case about a UFO sighting over a military facility in NJ in 1949 and hope to find more, maybe even with similarities to current-day sightings.

submitted by /u/daf1115
[link] [comments]  

Read More