Data Harvesting
Metadata needs to be harvested in order to construct the supporting database and vector database table. The metadata have been harvested from the three sources referenced in the metadata sources page. For proof-of-concept the metadata from the sources are collected and a subset of the fields in each source's records were stored as a collection of json documents in one json file. The fields used in each document object are:
id: string
title: string
description: string
keywords: string list of keywords
src: string describing where the document record orginated (fsgeodata, rda, datahub)
Overview
The cli.py
module provides a command-line interface for managing USFS (United States Forest Service) metadata catalog operations. It handles downloading, parsing, and storing metadata from multiple USFS data sources, including FSGeoData, DataHub, and RDA (Research Data Archive).
Table of Contents
Dependencies
The CLI requires the following Python packages:
typer # CLI framework
python-dotenv # Environment variable management
requests # HTTP requests
beautifulsoup4 # HTML/XML parsing
sentence-transformers # Text embeddings
langchain-text-splitters # Text chunking
Directory Structure
The CLI creates and uses the following directory structure for storing temporary files and output files:
tmp/
└── catalog/ # Default output directory (DEST_OUTPUT_DIR)
Commands Reference
load_catalog_data
Downloads and processes metadata from all configured sources (FSGeoData, DataHub, and RDA).
python -m catalog.cli load_catalog_data
What it does: 1. Downloads metadata from FSGeoData (XML files) 2. Downloads metadata from DataHub (JSON) 3. Downloads metadata from RDA (JSON) 4. Parses all metadata into a unified format 5. Merges documents and removes duplicates 6. Reports total unique assets found
Output:
- Downloaded files stored in tmp/catalog/
- Console output showing progress and asset counts
Processing details: - Combines title, description, keywords, and source into searchable text - Splits text into manageable chunks for embedding - Maintains metadata for each chunk including: - Document ID - Chunk type and index - Original title and description - Keywords and source
Utility Functions
create_output_dir()
Creates the output directory (tmp/catalog
) if it doesn't exist.
hash_string(s: str) -> str
Generates a SHA-256 hash of the input string. Used to create unique document IDs.
# Example
doc_id = hash_string("forest inventory analysis")
strip_html_tags(text: str) -> str
Removes HTML tags from text and replaces newlines with spaces.
# Example
clean_text = strip_html_tags("<p>Forest <b>data</b></p>")
# Returns: "Forest data"
get_keywords(item: dict) -> list
Extracts and cleans keywords from metadata items.
# Example
keywords = get_keywords({"keywords": "forest, inventory, analysis"})
# Returns: ["forest", "inventory", "analysis"]
merge_docs(*docs: List[Dict]) -> List[Dict]
Merges multiple document lists, removing duplicates based on document ID.
# Example
unique_docs = merge_docs(fsgeodata_docs, datahub_docs, rda_docs)
find_duplicate_documents(documents: list) -> list
Identifies duplicate documents in a list based on their IDs.
FSGeoData Functions
_download_fsgeodata_metadata()
- Scrapes the FSGeoData website for metadata links
- Downloads XML metadata files
- Stores files in
tmp/catalog/
_parse_fsgeodata_metadata() -> list
- Reads all XML files from the output directory
- Extracts title, abstract, and keywords
- Returns list of metadata dictionaries with:
id
: Hash of the titletitle
: Dataset titledescription
: Abstract textmetadata_source_url
: Source URLkeywords
: List of theme keywordssrc
: "fsgeodata"
_fsgeodata() -> list
Main function that downloads and parses FSGeoData metadata.
DataHub Functions
_download_datahub_metadata()
- Downloads DCAT-US 1.1 JSON from USFS DataHub
- Stores as
datahub_metadata.json
_parse_datahub_metadata() -> list
- Parses the DataHub JSON file
- Extracts metadata for each dataset
- Returns list of metadata dictionaries with:
id
: Hash of the titletitle
: Dataset titleidentifier
: Unique identifierdescription
: Cleaned descriptionurl
: Dataset URLkeywords
: List of keywordssrc
: "datahub"
_datahub() -> list
Main function that downloads and parses DataHub metadata.
RDA Functions
_download_rda_metadata()
- Downloads JSON from Research Data Archive web service
- Stores as
rda_metadata.json
_parse_rda_metadata() -> list
- Parses the RDA JSON file
- Extracts metadata for each dataset
- Returns list of metadata dictionaries with:
id
: Hash of the titletitle
: Dataset titleidentifier
: Unique identifierdescription
: Cleaned descriptionurl
: Dataset URLkeywords
: List of keywordssrc
: "rda"
_rda() -> list
Main function that downloads and parses RDA metadata.
Data Sources
FSGeoData
- URL: https://data.fs.usda.gov/geodata/edw/datasets.php
- Format: XML metadata files
- Content: Geospatial datasets from the Forest Service
DataHub
- URL: https://data-usfs.hub.arcgis.com/api/feed/dcat-us/1.1.json
- Format: DCAT-US 1.1 JSON
- Content: USFS datasets available through ArcGIS Hub
RDA (Research Data Archive)
- URL: https://www.fs.usda.gov/rds/archive/webservice/datagov
- Format: JSON
- Content: Research datasets from the Forest Service
Data Flow
graph TD
A[Start] --> B[Create Output Directory]
B --> C{Command}
C --> D[Download FSGeoData XML]
D --> E[Download DataHub JSON]
E --> F[Download RDA JSON]
F --> G[Parse All Metadata]
G --> H[Merge & Deduplicate]
H --> I[Report Statistics]
Usage Examples
Complete Workflow
- Download all metadata:
python -m catalog.cli load_catalog_data
Output example:
Loading all catalog data.
Downloading all fsgeodata metadata.
Found 523 fsgeodata assets.
Done downloading all fsgeodata metadata!
-----
Downloading all datahub metadata.
Found 312 datahub assets.
Done downloading all datahub metadata!
-----
Downloading all rda metadata.
Found 89 rda assets.
Done downloading all rda metadata!
Done loading all catalog data!
Total unique assets: 924
Error Handling
The CLI currently uses basic error handling. Consider these improvements for production:
- Network errors: Add retry logic for failed downloads
- File I/O errors: Handle missing directories and permission issues
- Database errors: Add connection error handling and transaction management
- Parsing errors: Handle malformed XML/JSON gracefully
Performance Considerations
Future Enhancements
Potential improvements to consider:
- Progress bars: Add visual progress indicators for long-running operations
- Logging: Implement structured logging instead of print statements
- Configuration: Move hardcoded values to configuration files
- Validation: Add data validation for parsed metadata
- Incremental updates: Support updating only changed metadata
- Export formats: Add options to export metadata in different formats
- Search functionality: Add commands to search the embedded documents
Related Modules
catalog.db
: Database operations and vector storagecatalog.schema
: Data models (USFSDocument)catalog.api
: API service implementationcatalog.llm
: Language model integration