| ADL Glossary of Extremism & Hate Symbols Database | This database provides an overview of many of the terms and individuals used by or associated with movements and groups that subscribe to and/or promote... |
| CDC datasets | All the datasets of the CDC, before censoring took place. |
| cdc.gov all web rip | A rip of the cdc website for use with Kiwix software, before major updates and redactions were made to the site. |
| Copyright and Artificial Intelligence Part 3: Generative AI Training, pre-publication version | The 108-page report provides the Office’s detailed take on how U.S. copyright law, particularly the fair use doctrine, should apply to the use of... |
| Coronavirus.gov | The CDC's official coronavirus site, coronavirus.gov and related sites including covid.gov, covidtests.gov, and covidtest.gov |
| EPA Integrated Risk Information System (IRIS)/CompTox Webarchives | Full webarchive of IRIS EPA portal, and full downloads of Comptox base data. |
| EPA Research Portal Webarchive | Attempt at a full webarchive of epa/research/, and all subsections (air-research, chemical-research etc). Includes linked files (pdfs, video, etc). |
| Flybase | FlyBase (http://www.flybase.org) houses information about the structure and function of the Drosophila genome. |
| fmcs.gov-resources | Mirror of https://www.fmcs.gov/resources/, all subpages (plain HTML, no images or CSS), zipped up to documents.tar.zst Includes a copy of the fmcsinfo... |
| globalchange.gov rip | Archive of the globalchange.gov website |
| Mouse Models of Human Cancer Database (MMHCdb) | Pulled using browsertrix-crawler as well as a custom shell script to wget the csv files |
| National Science Foundation website | National Science Foundation website |
| NIST Atomic Structure Database | This database provides access and search capability for NIST critically evaluated data on atomic energy levels, wavelengths, and transition probabilities... |
| NOAA Oceanic Climate Data Records S3 Buckets | NOAA's Climate Data Records (CDRs) are robust, sustainable, and scientifically sound climate records that provide trustworthy information on how, where, and... |
| NOAA Weather Data | A complete mirror of ftp://ftp.ncei.noaa.gov/pub/data/noaa/ |
| NOAA webrips | webrips of all subdomains listed as dataset parts |
| NOAA's Weather And Climate Toolkit 4.8.1 | NOAA's weather and climate toolkit software full version 4.8.1, for MacOS and Windows, Intel X86 and ARM AArch64, also without JE. PDFs of the landing and FAQ pages. |
| Office of Homeland Security Statistics | Whole website, including PDFs and other data of the Office of Homeland Security Statistics. |
| Regenbogenportal | The 'Regenbogenportal' (Rainbow Portal) was a public resource about diversity in sexuality and gender. Editorial responsibility, as well as funding, has... |
| ResearchConnections.org WGET Scrape | This is a WGET scrape of https://researchconnections.org/. I've removed the captcha resources, as well as the search result files generated by WGET during... |
| TM SGNL | The source code, and website screenshots, from the TeleMessage site. The source code and most pages referring to the signal archiver app have been removed. |
| US Institute of Peace website: Publications | This is a limited scrape of www.usip.org just before it went offline on 2025-03-19. |
| USIP Podcast Network | This includes all podcasts that are part of the USIP Podcast Network, in .mp3 format with .json metadata. They have 5 shows; Events at USIP, On Peace,... |
| USPTO - PatentsView Database Tables | PatentsView is an award-winning visualization, data dissemination, and analysis platform that focuses on intellectual property (IP) data. Support for the... |
| USPTO PatentsView Github | This is a scrape of the PatentsView github |
| USPTO patentsview site | PatentsView is an award-winning visualization, data dissemination, and analysis platform that focuses on intellectual property (IP) data. Support for the... |