Data Library on Occupational Characteristics


          This page provides access to and explains the construction of the datasets used in "The
          Evolution of Work in the United States" and "New Technologies and the Labor Market." These
          datasets were constructed from text originally published in the Boston Globe, New York Times,
          and Wall Street Journal. The datasets contain information on occupations' skill requirements,
          technology usage, work activities, work styles, and other job characteristics, between 1940
          and 2000.

Downloadable Data

          A) By job title and year (388MB)
          B) By job title (308MB)
          C) By SOC code and year (73MB)
          D) By SOC code (1.5MB)
          E) By OCC code and year (46MB)
          F) By OCC code (1.0MB)
          G) By job title, source, and year (405MB)
          H) By job title and source (313MB)

Documentation

          1) Codebook/Overview of the dataset
          2) Description of the initial text cleaning
              Python Notebook related to this step.
          3) Details of the LDA procedure to classify ads as jobs ads vs. other types of ads
              LDA results
              Python Notebook related to this step.
          4) Details of the procedure to discern the job title, boundaries between ads
              A list of job title words.
              Python Notebook related to this step.           5) Mapping between occupational characteristics and words/phrases
              Mapping between job titles, SOC codes, and OCC codes
              Description of the variables in the mapping between job titles to occupation codes.
              An explanation for these mappings
              Python Notebook: estimating the CBOW model
              Python Notebook: retrieving SOC codes from the CBOW model.
             


The downloadable data were last updated May 15, 2019.
For the first version of the data, available from July 1, 2017 to April 6, 2018, see the following link.
For the second version of the data, available from April 7, 2018 to May 15, 2019, see the following link.
For a summary of the differences between the first and second versions, see the following document.
For a summary of the differences between the second and third versions, see the following document.