Data Harvesting

Metadata needs to be harvested in order to construct the supporting database and vector database table. The metadata have been harvested from the three sources referenced in the metadata sources page. For proof-of-concept the metadata from the sources are collected and a subset of the fields in each source's records were stored as a collection of json documents in one json file. The fields used in each document object are:

    id: string
    title: string
    description: string 
    keywords: string list of keywords
    src: string describing where the document record orginated (fsgeodata, rda, datahub)

Overview

The cli.py module provides a command-line interface for managing USFS (United States Forest Service) metadata catalog operations. It handles downloading, parsing, and storing metadata from multiple USFS data sources, including FSGeoData, DataHub, and RDA (Research Data Archive).

Commands Reference

`load_catalog_data`

Downloads and processes metadata from all configured sources (FSGeoData, DataHub, and RDA).

python -m catalog.cli load_catalog_data

What it does: 1. Downloads metadata from FSGeoData (XML files) 2. Downloads metadata from DataHub (JSON) 3. Downloads metadata from RDA (JSON) 4. Parses all metadata into a unified format 5. Merges documents and removes duplicates 6. Reports total unique assets found

Output: - Downloaded files stored in tmp/catalog/ - Console output showing progress and asset counts

Processing details: - Combines title, description, keywords, and source into searchable text - Splits text into manageable chunks for embedding - Maintains metadata for each chunk including: - Document ID - Chunk type and index - Original title and description - Keywords and source

Utility Functions

`create_output_dir()`

Creates the output directory (tmp/catalog) if it doesn't exist.

`hash_string(s: str) -> str`

Generates a SHA-256 hash of the input string. Used to create unique document IDs.

# Example
doc_id = hash_string("forest inventory analysis")

`strip_html_tags(text: str) -> str`

Removes HTML tags from text and replaces newlines with spaces.

# Example
clean_text = strip_html_tags("<p>Forest <b>data</b></p>")
# Returns: "Forest data"

`get_keywords(item: dict) -> list`

Extracts and cleans keywords from metadata items.

# Example
keywords = get_keywords({"keywords": "forest, inventory, analysis"})
# Returns: ["forest", "inventory", "analysis"]

`merge_docs(*docs: List[Dict]) -> List[Dict]`

Merges multiple document lists, removing duplicates based on document ID.

# Example
unique_docs = merge_docs(fsgeodata_docs, datahub_docs, rda_docs)

`find_duplicate_documents(documents: list) -> list`

Identifies duplicate documents in a list based on their IDs.

FSGeoData Functions

`_download_fsgeodata_metadata()`

Scrapes the FSGeoData website for metadata links
Downloads XML metadata files
Stores files in tmp/catalog/

`_parse_fsgeodata_metadata() -> list`

Reads all XML files from the output directory
Extracts title, abstract, and keywords
Returns list of metadata dictionaries with:
id: Hash of the title
title: Dataset title
description: Abstract text
metadata_source_url: Source URL
keywords: List of theme keywords
src: "fsgeodata"

`_fsgeodata() -> list`

Main function that downloads and parses FSGeoData metadata.

DataHub Functions

`_download_datahub_metadata()`

Downloads DCAT-US 1.1 JSON from USFS DataHub
Stores as datahub_metadata.json

`_parse_datahub_metadata() -> list`

Parses the DataHub JSON file
Extracts metadata for each dataset
Returns list of metadata dictionaries with:
id: Hash of the title
title: Dataset title
identifier: Unique identifier
description: Cleaned description
url: Dataset URL
keywords: List of keywords
src: "datahub"

`_datahub() -> list`

Main function that downloads and parses DataHub metadata.

RDA Functions

`_download_rda_metadata()`

Downloads JSON from Research Data Archive web service
Stores as rda_metadata.json

`_parse_rda_metadata() -> list`

Parses the RDA JSON file
Extracts metadata for each dataset
Returns list of metadata dictionaries with:
id: Hash of the title
title: Dataset title
identifier: Unique identifier
description: Cleaned description
url: Dataset URL
keywords: List of keywords
src: "rda"

`_rda() -> list`

Main function that downloads and parses RDA metadata.

Data Sources

FSGeoData

URL: https://data.fs.usda.gov/geodata/edw/datasets.php
Format: XML metadata files
Content: Geospatial datasets from the Forest Service

DataHub

URL: https://data-usfs.hub.arcgis.com/api/feed/dcat-us/1.1.json
Format: DCAT-US 1.1 JSON
Content: USFS datasets available through ArcGIS Hub

RDA (Research Data Archive)

URL: https://www.fs.usda.gov/rds/archive/webservice/datagov
Format: JSON
Content: Research datasets from the Forest Service

Data Flow

graph TD
    A[Start] --> B[Create Output Directory]
    B --> C{Command}

    C --> D[Download FSGeoData XML]
    D --> E[Download DataHub JSON]
    E --> F[Download RDA JSON]
    F --> G[Parse All Metadata]
    G --> H[Merge & Deduplicate]
    H --> I[Report Statistics]

Usage Examples

Complete Workflow

Download all metadata:

python -m catalog.cli load_catalog_data

Output example:

Loading all catalog data.
    Downloading all fsgeodata metadata.
    Found 523 fsgeodata assets.
    Done downloading all fsgeodata metadata!
-----
    Downloading all datahub metadata.
    Found 312 datahub assets.
    Done downloading all datahub metadata!
-----
    Downloading all rda metadata.
    Found 89 rda assets.
    Done downloading all rda metadata!
Done loading all catalog data!
Total unique assets: 924

Error Handling

The CLI currently uses basic error handling. Consider these improvements for production:

Network errors: Add retry logic for failed downloads
File I/O errors: Handle missing directories and permission issues
Database errors: Add connection error handling and transaction management
Parsing errors: Handle malformed XML/JSON gracefully

Performance Considerations

Future Enhancements

Potential improvements to consider:

Progress bars: Add visual progress indicators for long-running operations
Logging: Implement structured logging instead of print statements
Configuration: Move hardcoded values to configuration files
Validation: Add data validation for parsed metadata
Incremental updates: Support updating only changed metadata
Export formats: Add options to export metadata in different formats
Search functionality: Add commands to search the embedded documents

catalog.db: Database operations and vector storage
catalog.schema: Data models (USFSDocument)
catalog.api: API service implementation
catalog.llm: Language model integration

Data Harvesting

Overview

Table of Contents

Dependencies

Directory Structure

Commands Reference

`load_catalog_data`

Utility Functions

`create_output_dir()`

`hash_string(s: str) -> str`

`strip_html_tags(text: str) -> str`

`get_keywords(item: dict) -> list`

`merge_docs(*docs: List[Dict]) -> List[Dict]`

`find_duplicate_documents(documents: list) -> list`

FSGeoData Functions

`_download_fsgeodata_metadata()`

`_parse_fsgeodata_metadata() -> list`

`_fsgeodata() -> list`

DataHub Functions

`_download_datahub_metadata()`

`_parse_datahub_metadata() -> list`

`_datahub() -> list`

RDA Functions

`_download_rda_metadata()`

`_parse_rda_metadata() -> list`

`_rda() -> list`

Data Sources

FSGeoData

DataHub

RDA (Research Data Archive)

Data Flow

Usage Examples

Complete Workflow

Error Handling

Performance Considerations

Future Enhancements

Data Harvesting

Overview

Table of Contents

Dependencies

Directory Structure

Commands Reference

load_catalog_data

Utility Functions

create_output_dir()

hash_string(s: str) -> str

strip_html_tags(text: str) -> str

get_keywords(item: dict) -> list

merge_docs(*docs: List[Dict]) -> List[Dict]

find_duplicate_documents(documents: list) -> list

FSGeoData Functions

_download_fsgeodata_metadata()

_parse_fsgeodata_metadata() -> list

_fsgeodata() -> list

DataHub Functions

_download_datahub_metadata()

_parse_datahub_metadata() -> list

_datahub() -> list

RDA Functions

_download_rda_metadata()

_parse_rda_metadata() -> list

_rda() -> list

Data Sources

FSGeoData

DataHub

RDA (Research Data Archive)

Data Flow

Usage Examples

Complete Workflow

Error Handling

Performance Considerations

Future Enhancements

Related Modules

`load_catalog_data`

`create_output_dir()`

`hash_string(s: str) -> str`

`strip_html_tags(text: str) -> str`

`get_keywords(item: dict) -> list`

`merge_docs(*docs: List[Dict]) -> List[Dict]`

`find_duplicate_documents(documents: list) -> list`

`_download_fsgeodata_metadata()`

`_parse_fsgeodata_metadata() -> list`

`_fsgeodata() -> list`

`_download_datahub_metadata()`

`_parse_datahub_metadata() -> list`

`_datahub() -> list`

`_download_rda_metadata()`

`_parse_rda_metadata() -> list`

`_rda() -> list`