Data Library on Occupational Characteristics
This page provides access to and explains the construction of the datasets used in "The
Evolution of Work in the United States" and "New Technologies and the Labor Market." These
datasets were constructed from text originally published in the Boston Globe, New York Times,
and Wall Street Journal. The datasets contain information on occupations' skill requirements,
technology usage, work activities, work styles, and other job characteristics, between 1940
and 2000.
Downloadable Data
A) By job title and year (338MB)
B) By job title (268MB)
C) By SOC code and year (57MB)
D) By SOC code (1.3MB)
E) By OCC code and year (37MB)
F) By OCC code (0.9MB)
Documentation
1) Codebook/Overview of the dataset
2) Description of the initial text cleaning
Python Notebook related to this step.
3) Details of the LDA procedure to classify ads as jobs ads vs. other types of ads
LDA results
Python Notebook related to this step.
4) Details of the procedure to discern the job title, boundaries between ads
A list of job title words.
Python Notebook related to this step.
5) Mapping between occupational characteristics and words/phrases
Mapping between job titles, SOC codes, and OCC codes
An explanation for these mappings
Python Notebook: estimating the CBOW model
Python Notebook: retrieving SOC codes from the CBOW model.
This is the version of the data library that was up to date until May 15, 2019.
The downloadable data on this page were last updated April 7, 2018.
A more recent version of the data were posted here on May 15, 2019.