Public Dataset Mirrors

ADL Glossary of Extremism & Hate Symbols Database This database provides an overview of many of the terms and individuals used by or associated with movements and groups that subscribe to and/or promote...
CDC datasets All the datasets of the CDC, before censoring took place.
cdc.gov all web rip A rip of the cdc website for use with Kiwix software, before major updates and redactions were made to the site.
Copyright and Artificial Intelligence Part 3: Generative AI Training, pre-publication version The 108-page report provides the Office’s detailed take on how U.S. copyright law, particularly the fair use doctrine, should apply to the use of...
Coronavirus.gov The CDC's official coronavirus site, coronavirus.gov and related sites including covid.gov, covidtests.gov, and covidtest.gov
EPA Integrated Risk Information System (IRIS)/CompTox Webarchives Full webarchive of IRIS EPA portal, and full downloads of Comptox base data.
EPA Research Portal Webarchive Attempt at a full webarchive of epa/research/, and all subsections (air-research, chemical-research etc). Includes linked files (pdfs, video, etc).
Flybase FlyBase (http://www.flybase.org) houses information about the structure and function of the Drosophila genome.
fmcs.gov-resources Mirror of https://www.fmcs.gov/resources/, all subpages (plain HTML, no images or CSS), zipped up to documents.tar.zst Includes a copy of the fmcsinfo...
globalchange.gov rip Archive of the globalchange.gov website
Mouse Models of Human Cancer Database (MMHCdb) Pulled using browsertrix-crawler as well as a custom shell script to wget the csv files
National Science Foundation website National Science Foundation website
NIST Atomic Structure Database This database provides access and search capability for NIST critically evaluated data on atomic energy levels, wavelengths, and transition probabilities...
NOAA Oceanic Climate Data Records S3 Buckets NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and...
NOAA Weather Data A complete mirror of ftp://ftp.ncei.noaa.gov/pub/data/noaa/
NOAA webrips webrips of all subdomains listed as dataset parts
NOAA's Weather And Climate Toolkit 4.8.1 NOAA's weather and climate toolkit software full version 4.8.1, for MacOS and Windows, Intel X86 and ARM AArch64, also without JE. PDFs of the landing and FAQ pages.
Office of Homeland Security Statistics Whole website, including PDFs and other data of the Office of Homeland Security Statistics.
Regenbogenportal The 'Regenbogenportal' (Rainbow Portal) was a public resource about diversity in sexuality and gender. Editorial responsibility, as well as funding, has...
ResearchConnections.org WGET Scrape This is a WGET scrape of https://researchconnections.org/. I've removed the captcha resources, as well as the search result files generated by WGET during...
TM SGNL The source code, and website screenshots, from the TeleMessage site. The source code and most pages referring to the signal archiver app have been removed.
US Institute of Peace website: Publications This is a limited scrape of www.usip.org just before it went offline on 2025-03-19.
USIP Podcast Network This includes all podcasts that are part of the USIP Podcast Network, in .mp3 format with .json metadata. They have 5 shows; Events at USIP, On Peace,...
USPTO - PatentsView Database Tables PatentsView is an award-winning visualization, data dissemination, and analysis platform that focuses on intellectual property (IP) data. Support for the...
USPTO PatentsView Github This is a scrape of the PatentsView github
USPTO patentsview site PatentsView is an award-winning visualization, data dissemination, and analysis platform that focuses on intellectual property (IP) data. Support for the...