← Back to dashboard

Methodology

This page documents how the data behind this visualization was collected, cleaned, transformed, and aggregated. Every number shown in the dashboard can be traced back to the steps described below.

Data source

CAE Historical Database

The primary dataset is the Base de Datos Histórica published by Comisión INGRESA — the government body that administers Chile's state-backed student loan program (CAE).

The file is a semicolon-delimited .txt export with ~12 million rows and 42 columns, one row per student–year loan record, covering every CAE cohort from 2006 to 2024.

Key fields include: student identifier, registered gender, family region and income quintile, secondary-school dependency, institution name and type, program name, loan amounts requested and reference tuition, loan year, and beneficiary/outcome status.

Data cleaning

Loading & Normalization

The raw file was read using Python / pandas with latin1 encoding and PyArrow as the parsing engine.

  • Column names and all string cells were stripped of leading/trailing whitespace.
  • Monetary fields (arancel_solicitado and arancel_referencia) contained non-numeric characters (dots, commas, spaces) that were removed before casting to integer.
  • Year columns contained floating-point artifacts from the source export; values were multiplied by 1 000, rounded, and cast to nullable integers to recover the correct year.
  • Records with missing or zero tuition values were retained in the database but excluded from per-row monetary calculations.
Inflation adjustment

Adjusting to 2024 CLP

Because the dataset spans 18 cohort years, nominal loan amounts are not directly comparable across time. Each record's requested tuition was multiplied by a year-specific cumulative inflation factor to express amounts in 2024 Chilean Pesos (CLP):

adjusted = arancel_solicitado × (1 + inflation_factor)

The factors were set so that 2024 = 1.0 (no adjustment). The 2006 factor is ~1.10, meaning a 2006 loan amount must grow by ~110 % to match its 2024 purchasing-power equivalent. Factors for each year between 2006 and 2023 decrease monotonically toward zero.

Currency

CLP → USD Conversion

Dollar figures shown in the dashboard use a fixed reference exchange rate of 950 CLP per USD.

This single rate is applied uniformly to all years after inflation adjustment, so comparisons across cohorts reflect real purchasing power rather than historical exchange-rate fluctuations. It is an approximation intended for readability, not financial precision.

Metrics

How Each Indicator Is Calculated

  • New loans granted — count of records where tipo_beneficiario = "NUEVO BENEFICIARIO", grouped by year, region, income quintile, and gender.
  • Total borrowed (USD) — sum of inflation-adjusted tuition requests divided by 950, for the selected filters.
  • Average loan (USD) — mean of inflation-adjusted amounts divided by 950, weighted by transaction count.
  • % of tuition financed — mean of arancel_solicitado / arancel_referencia, restricted to values in the range (0, 1] to exclude data-entry errors.
  • Average years financed — mean number of distinct loan years associated with each student identifier.
  • Graduation rateSUM(egresos) / SUM(total_program_enrollments); counts records whose final status is graduation (egreso).
  • Desertion rateSUM(deserciones) / SUM(total_program_enrollments); counts records whose final status is formal dropout (deserción).
  • Student paths — each unique student–program combination is classified as: direct graduation, graduation after suspension, direct desertion, desertion after suspension, or career change (within or across institutions).
Limitations

Caveats & Known Constraints

  • The exchange rate (950 CLP/USD) is fixed and does not reflect year-by-year fluctuations in the peso–dollar rate.
  • Family income quintile is self-reported by applicants at the time of the loan application and may not reflect current economic status.
  • Geographic classification is based on the student's family region of origin, not the location of the institution attended.
  • Outcome statuses (graduation, desertion) are reported by institutions to INGRESA and may include reporting lags of one to two years for recent cohorts.
  • Records with missing quintile, region, or gender values are included in database totals but excluded from breakdowns that require those fields, which may cause small discrepancies between aggregate and disaggregated totals.
  • The dataset is administrative by nature and reflects only students who held an active CAE loan; students who repaid early, transferred to gratuity (free tuition), or whose records were merged may be undercounted in later years.
  • The dataset records reference tuition (arancel_referencia) rather than the actual tuition charged by each institution. Loan approval rates were derived by comparing the amount requested (arancel_solicitado) against the reference tuition, as lower-income students tend to request a higher share of tuition costs yet face lower approval rates.