Historical API

Sometimes it is preferable to retrieve all history or a daily update feed instead of directly querying the dataset. This is often the case for clients who use the API strictly to download a daily feed.

Dataset History List Endpoint

To retrieve a list of available history for a dataset:

GET /datasets/<DATASET_ID>/history/ HTTP/1.1
Authorization: token 01234567890123456789
X-API-Version: 20151130
Accept: application/json
curl -L "https://data.thinknum.com/datasets/<DATASET_ID>/history/" \
     -H 'Accept: application/json' \
     -H 'Authorization: token 01234567890123456789' \
     -H 'X-API-Version: 20151130'

For example, for "Traction" dataset you would get a response similar to the following, indicating that there are 4 days of daily history available for download and 2 months of monthly history:

{
  "total": 5,
  "id": "traction",
  "history": [
    "2019-01-04",
    "2019-01-03",
    "2019-01-02",
    "2019-01-01",
    "2018-12",
    "2018-11"
  ]
}

At the beginning of every month, the last month's daily history is combined into a single monthly file.

Dataset History Download Endpoint

Once you have identified the dataset and historical day/month you're interested in, you can view the metadata for the historical file:

GET /datasets/<DATASET_ID>/history/<HISTORY_DATE> HTTP/1.1
Authorization: token 01234567890123456789
X-API-Version: 20151130
Accept: application/json
curl -L "https://data.thinknum.com/datasets/<DATASET_ID>/history/<HISTORY_DATE>" \
     -H 'Accept: application/json' \
     -H 'Authorization: token 8930af13ac0bd6506a792ccabbf46d80' \
     -H 'X-API-Version: 20151130'

For example, the metadata for the "2018-12" historical file for the "Traction" dataset would have a response similar to the following, indicating that there are 3,464,374 rows in the historical file:

{
  "date_updated": "2018-12",
  "status": 200,
  "total": 3464374,
  "id": "traction"
}

To download a CSV of the historical data, simply change the "Accept" parameter to "text/csv":

GET /datasets/<DATASET_ID>/history/<HISTORY_DATE> HTTP/1.1
Authorization: token 01234567890123456789
X-API-Version: 20151130
Accept: text/csv
curl -L "https://data.thinknum.com/datasets/<DATASET_ID>/history/<HISTORY_DATE>" \
     -H 'Accept: text/csv' \
     -H 'Authorization: token 01234567890123456789' \
     -H 'X-API-Version: 20151130' \
     -o '<HISTORY_DATE>.csv'

Gzip compression is also available to speed up the download and consume less bandwidth. It can be enabled through the standard HTTP protocol "Accept-Encoding" header:

GET /datasets/<DATASET_ID>/history/<HISTORY_DATE> HTTP/1.1
Authorization: token 01234567890123456789
X-API-Version: 20151130
Accept: text/csv
Accept-Encoding: gzip
curl -L "https://data.thinknum.com/datasets/<DATASET_ID>/history/<HISTORY_DATE>" \
     -H 'Accept: text/csv' \
     -H 'Accept-Encoding: gzip' \
     -H 'Authorization: token 01234567890123456789' \
     -H 'X-API-Version: 20151130' \
     -o '<HISTORY_DATE>.csv' \
     --compressed

❗️

If using cURL to download historical files, you must be using version 7.58.0 or above due to a bug listed under CVE-2018-1000007

History CSV Data Format

The data for historical updates is provided in standard CSV format with header.

Each record contains a unique identifier column, allowing you to sync your datastore with any additions or updates in the Thinknum dataset.

How to parse History CSV using Python

Once you download history CSV file, then you can parse it by using Python.

To parse file by using "csv" module which is Python default module.

import csv
headers = []
rows = []
with open('/mnt/.../job-listings.csv', 'r') as handle:
    reader = csv.reader(
        handle,
        delimiter=',',
        quotechar='"',
        escapechar='\\',
    )
    headers = next(reader)
    for row in reader:
        rows.append(row)

To parse file by utilizing "pandas". You need to install "pandas" library additionally.

import pandas as pd
df = pd.read_csv(
    '/mnt/.../job-listings.csv',
    delimiter=',',
    quotechar='"',
    escapechar='\\'
)

History Test Script (python3)

The below script will download the last 15 dates for a specified dataset and verify that it can parse every row by making sure each row has the same amount of columns.

import codecs
import json
import requests
import contextlib
import csv

api_host = 'https://data.thinknum.com'
api_version = '20151130'
api_client_id = 'CLIENT_ID GOES HERE'
api_client_secret = 'CLIENT_SECRET GOES HERE'  # THIS KEY SHOULD NEVER BE PUBLICALLY ACCESSIBLE
dataset_id = 'linkedin_company'

# ### STEP 1: Authorization
# Setup and send request to get an authorization token

payload = {
    'version': api_version,
    'client_id': api_client_id,
    'client_secret': api_client_secret
}
request_url = api_host + '/api/authorize'
r = requests.post(request_url, data=payload)

if r.status_code != 200:
    raise Exception('Failed to authorize: ' + r.text)

token_data = json.loads(r.text)

api_auth_token = token_data['auth_token']
api_auth_expires = token_data['auth_expires']
api_auth_headers = {
    "X-API-Version": api_version,
    "Authorization": f"token {api_auth_token}"
}

print('Authorization Token', api_auth_token)

# ## STEP 2: Sample history list endpoint
# Gets a list of all available history for dataset
api_auth_token = token_data['auth_token']
api_auth_expires = token_data['auth_expires']
api_auth_headers = {
    "X-API-Version": api_version,
    "Authorization": f"token {api_auth_token}"
}

request_url = f'{api_host}/datasets/{dataset_id}/history'
r = requests.get(request_url, headers=api_auth_headers)

if r.status_code != 200:
    raise Exception('Failed to GET: ' + r.text)

datasets_list = json.loads(r.text)

print('Available History', json.dumps(datasets_list['history']))

# # ### STEP 3: Sample history file endpoint
# # Test the history for the last 15 dates

dates = datasets_list['history'][0:15]

for date_to_fetch in dates:

    request_url = f'{api_host}/datasets/{dataset_id}/history/{date_to_fetch}'
    history_headers = api_auth_headers.copy()
    history_headers['Accept'] = 'text/csv'
    history_headers['Accept-Encoding'] = 'gzip'

    r = requests.get(request_url, headers=history_headers, stream=True)

    if r.status_code != 200:
        raise Exception('Failed to GET: ' + r.text)

    header_count = 0
    row_count = 0
    with contextlib.closing(r) as csv_stream:
        reader = csv.reader(
            codecs.iterdecode(csv_stream.iter_lines(), 'utf-8'),
            delimiter=',',
            quotechar='"',
            doublequote=False,
            escapechar='\\',
            lineterminator='\r\n'
        )
        for row in reader:
            row_count += 1
            col_count = len(row)

            if not header_count:
                header_count = col_count

            if header_count != col_count:
                raise Exception('Found row that was not parsed correctly', row_count, header_count, col_count)

    print(date_to_fetch, row_count)