# Data Cleaning Scripts for BPEA Paper "Measuring the Labor Market at the Onset of the COVID-19 Crisis"

The sequence of the scripts indicates potential dependency.

## Crosswalks
| Script | Description |
|:----------|:----------|
| `cw_geo_nber.R` | Reads in and organizes NBER crosswalk for county, MSA, and state codes. |
| `cw_geo_zip.R` | Reads in and organizes HUD ZIP-FIPS crosswalk. |
| `cw_naics.R` | Reads in and organizes NAICS codes. |
| `cw_date_st_reg.R` | Reads in and organizes dates of stay-at-home orders and reopen orders. |
| `cw_date_school.R` | Reads in and organizes dates of school closure. |
| `cw_date_ui.R` | Reads in and organizes dates of UI distribution. |

## Homebase Data
| Script | Description |
|:----------|:----------|
| `data_0_raw_clean.R` | Functions that conduct the most basic cleaning of Homebase data. |
| `data_1_cw_geo_raw.R` | Output raw crosswalks for MSA and other geographical variables. |
| `data_1_cw_geo.R` | Manually fix MSA and state for some observations and merge with NBER geo crosswalk. These geographical variables have been deprecated. |
| `data_1_cw_geo_improved.R` | Improves Homebase geographical variable (county FIPS, MSA, and state codes) based on zip codes and HUD crosswalk. |
| `data_1_cw_owner.R` | Produces a crosswalk from Homebase establishments to owners. | 
| `data_1_raw.R` | Appends raw data together and conducts basic cleaning (no subsetting). Memory and time intensive. |
| `data_1_raw_update.R` | Appends raw data with daily updates (no subsetting). |
| `data_1_sel_firm_year.R` | Selects firms in base period for 2018-2020. |
| `data_2_firm_ind_geo.R` | Aggregates data to firm-ind-geo level. |

## Homebase Worker Survey
| Script | Description |
|:----------|:----------|
| `ws_1_quest.R` | Reads in responses to each question. |
| `ws_1_raw.R` | Reads in and clean raw survey data (including factorize responses when possible). |
| `ws_1_userid_var_sel.R` | Subset of respondents who are in the base period and associated with baseline firms. |
| `ws_2_hours_match.R` | Produces hours data for respondents who are in the base period and associated with baseline firms. |
| `ws_0_f_qsubset.R` | Conditions for each question. Used to produce tables. |
| `wso_1_tab_allQ.Rmd` | Table for each question. |
| `wso_1_crosstab_Qsel.Rmd` | Crosstabs for selected questions. |

## SafeGraph Data
| Script | Description |
|:----------|:----------|
| `sg_1_core_poi.R` | Reads in Core POI files. |
| `sg_1_core_poi_merge.R` | Merges all versions of Core POI together based on locid, zip, naics. |
| `sg_1_visit_raw.R` | Reads in raw weekly patterns files from 2019-12-30 to 2020-05-11.  Memory and time intensive. |
| `sg_1_visit_raw_update.R` | Reads in raw weekly patterns files after 2020-05-11 and appends them to data for earlier dates. Memory and time intensive. |
| `sg_1_visit_sum_stats.R` | Reads in and combines normalization statistics and meta data. |
| `sg_1_visit_sel_loc.R` | Finds number of visits to locations in base period. |
| `sg_2_visit_agg.R` | Aggregates to location-date level and merge with Core POI. |

## Other Data
| Script | Description |
|:----------|:----------|
| `cbp_1_raw.R` | Reads in and cleans County Business Patterns (CBP) data at the state level. |
| `ppp_unzip.R` | Unzips PPP data. |
| `ppp_1_code.R` | Finds the levels of selected variable in the PPP data. |
| `ppp_2_stc.R` | Categorizes states based on PPP amount (and also median UI replacment rates). |
| `kr_1_raw.R` | Reads in the raw Kronos data. |
