The 1.0.0 makes big changes to the schema to make it much easier to extract clean, useful data from our `fetch`/`parse`/`normalize` stages.
High-level goal
The downstream pipelines that consume our data have to adapt to the wide variety of scraped data formats. To help us along, we are going to impose more structure on the schema formats so that the consumers of our data have to deal with fewer edge cases. In most cases, these changes should also make it easier to write correct ingestion stages.
Specific changes
- New pydantic enums to reduce ambiguity:
- `State` for describing US states and territories, with USPS two-character abbreviations.
- `ContactType`, with `"general"` and `"booking"` options.
- `DayOfWeek` for the days Monday - Sunday and "public holidays".
- `VaccineType`, with options for Pfizer/BioNTech, Moderna, Johnson & Johnson, and Oxford/AstraZeneca vaccines.
- `VaccineSupply`, with options for vaccine stock status.
- `WheelchairAccessLevel`, with various options for describing the wheelchair accessibility of the location.
- `VaccineProvider`, for common parent organizations such as retail pharamcy chains.
- `LocationAuthority`, for other authorities that identify locations, such as Google Places.
- Format validation for certain fields:
- `Address`:
- `zip` must be a ZIP or ZIP+4 code, if present.
- `state` must be a `State`, if present.
- `LatLng`:
- `latitude` must be between -90 and 90, inclusive, if present.
- `longitude` must be between -180 and 180, inclusive, if present.
- `Contact`:
- `contact_type` must be from the `ContactType` enum, if present.
- `phone` must be in the format of a 9 or 10 digit US phone number, if present.
- `website` must be an HTTP/HTTPS URL, if present.
- `email` must be formatted as an email address, if present.
- `OpenHour`:
- `day` must be a `DayOfWeek`.
- `open` has been renamed `opens` for better parallelism with `closes`.
- `Vaccine`:
- `vaccine` must be a `VaccineType`.
- `supply_level` must be from the `VaccineSupply` enum, if present.
- `Organization`:
- `id` should be from the `VaccineProvider` enum, if possible, but may be a string or empty. Using an enum value makes it easier for consumers to interpret the value.
- `id` must use only lowercase alphanumeric characters and underscores.
- `Link`:
- `authority` should be a `LocationAuthority`/`VaccineProvider`, if possible, but may be a string or empty. Using an an enum value makes it easier to use these links to match locations.
- `authority` must use only lowercase alphanumeric characters and underscores.
- `uri` must be a URL, if present.
- `Source`:
- `source` must use only lowercase alphanumeric characters and underscores.
- `id` must not use a space or colon. These must be replaced with another character, such as a dash.
- `fetched_from-uri` must be a URL, if present.
- `Location`:
- `id` must consist of only lowercase alphanumeric characters and underscores, with precisely one colon. The colon should separate the part of the ID that reflects the data source and the part of the ID that reflects the specific location.
- Additional requirements:
- Each `Contact` should have precisely one field (`phone`, `website`, `email`, `other`). Do not coalesce several of these into a single method.
- The `opens` value should be before or the same as the `closes` value on an `OpenDate`.
- The `opens` value should be before or the same as the `closes` value on an `OpenHour`.
- The `id` of a `Location` must be prefixed with the source name (specified in `Location.source.source`).