1. Context

The Ministry of National Development (MND) is responsible for planning Singapore's land use and ensuring the provision of affordable and accessible public housing. In recent years, HDB resale prices have experienced upward pressure, driven by a combination of supply-demand dynamics and location-based preferences. One key factor influencing buyer behaviour is proximity to desirable primary schools.

Under the admission framework administered by the Ministry of Education (MOE), priority is given to students living within specified distance bands, especially within 1km of a school. As a result, flats located near "good" primary schools are often perceived to command higher resale prices as households seek to secure admission advantages.

This motivates the core policy question: does proximity to "good" primary schools produce a measurable resale premium after controlling for other housing and location factors?

2. Scope

2.1 Problem

MND currently lacks a robust, data-driven estimate of the impact of proximity to "good" primary schools on HDB resale prices. Existing analyses often do not adequately control for confounding factors like flat characteristics, transport accessibility, and neighbourhood amenities, making it difficult to isolate school-proximity effect.

Without a clear estimate, MND faces challenges in assessing whether school-driven demand contributes to resale-market distortions when designing proportionate policy responses. There is also the need for a natural-language interface to allow non-technical policy officers to explore model predictions and findings without coding.

2.2 Success Criteria

Success is evaluated along two dimensions:

From a technical standpoint, the project should produce credible estimates of the price premium associated with proximity to high-demand primary schools.

Specifically:

Estimated school-proximity effects should be statistically significant at conventional levels (e.g. p < 0.05)

The 95% confidence interval for the estimated premium should be reasonably tight, with a total width below 5 percentage points, indicating sufficient precision for policy interpretation.

Results should be robust across specifications, meaning that the sign and magnitude of the school effect do not change drastically when additional controls are introduced.

From a policy perspective, success is defined by the ability to generate actionable and interpretable insights. Outputs should be presented in a form that is understandable to non-technical stakeholders, including estimated premiums in both percentage and dollar terms.

2.3 Assumptions

First, the concept of a “good” primary school is classified using observable proxies rather than a single definitive metric, including:

offers Gifted Education Programme (GEP) or Special Assistance Plan (SAP) status;

oversubscribed in Phase 2B of the Primary 1 registration exercise;

affiliation to a secondary school AND also oversubscribed in Phase 2C.

This definition was designed to reflect both institutional prestige from official programmes and resources, and revealed demand from past balloting data.

Second, in the OneMap-based routing pipeline, OneMap API calls are used mainly for nearest-distance MRT and mall features, while corresponding 10-minute threshold-count features are computed using Euclidean approximations to keep runtime and API usage tractable. This is a trade-off between route realism and computational or quota constraints.

Third, causal interpretation remains conditional. While modelling techniques like regression discontinuity design aim to approximate causal inference, they cannot fully eliminate selection effects or unobserved confounders. Hence, results should be interpreted as conditional approximations or local treatment effects, rather than definitive causal estimates.

3. Methodology

3.1 Technical Assumptions

The project makes some technical assumptions:

Spatial layers are normalized to WGS84 for ingestion and projected to SVY21 (EPSG:3414) where meter-based operations are required. This ensures consistent buffering and distance logic.

School boundaries are constructed by joining school points to URA master-plan land-use polygons, with de-duplication rules for repeated URA object IDs. 1km and 2km Euclidean buffers are subsequently generated. This comes from assuming that data is shared across ministries in Singapore, requiring collaboration between URA, MOE and HDB.

For routing features, the OneMap implementation uses candidate pre-filtering by Euclidean radius and nearest-candidate cap (k). The latest optimization keeps OneMap calls for nearest mall and MRT distances while computing 10-minute count features from Euclidean thresholds. Additional deduplication groups repeated origin coordinates to reduce repeated API calls.

Model-wise, resale price is modeled as log(resale_price) to stabilize variance and permit approximate percentage interpretation through exp(beta)-1. Time effects are absorbed through month and location effects through town fixed effects in OLS specifications.

3.2 Data

Feature Description Link
MRT Station Exits A GeoJSON that provides vector coordinates of points showing the locations of exits of all MRT stations in Singapore. This is joined with a separate dataset that contains MRT/LRT line information. MRT locations: data.gov.sg MRT dataset MRT lines: Kaggle MRT/LRT stations in Singapore
Bus Stops A GeoJSON that provides vector coordinates of bus stop locations in Singapore. data.gov.sg bus stop locations
Shopping Malls A GeoJSON or coordinate dataset that provides vector coordinates of shopping mall locations. Kaggle shopping mall coordinates
Supermarkets A GeoJSON that provides vector coordinates of supermarket locations in Singapore. data.gov.sg supermarket locations
Hawker Centres A GeoJSON that provides vector coordinates of NEA market and food centre locations in Singapore. data.gov.sg hawker centre locations
Parks A GeoJSON that provides vector coordinates of parks in Singapore. data.gov.sg parks dataset
URA Master Plan A GeoJSON layer that provides vectorised land parcel designation. data.gov.sg URA Master Plan dataset
HDB Existing Buildings A GeoJSON that provides vector objects of built HDB buildings. data.gov.sg HDB existing buildings dataset

Current coverage statistics in generated artifacts:

  • Total schools in subscription table: 179
  • "Good schools" selected: 44
  • Shopping centres mapped: 155
  • MRT exits tagged: 597
  • HDB polygons loaded: 13,386
  • Resale address points matched to HDB polygons: 9,568
  • Unmatched address points: 28

3.3 Experimental Design

The modelling strategy in this project separates price prediction from school-premium estimation.

A hedonic pricing model was first developed as the main predictive framework for HDB resale prices. Because the feature space contains many correlated engineered features and categorical controls, Ridge regression was used as the final predictive model and L2 regularisation helped stabilise coefficient estimates under multicollinearity. A reduced-feature specification with floor-area rebalancing was ultimately selected because it achieved the strongest overall holdout performance.

Concurrently, an OLS hedonic specification was retained for interpretability and diagnostics. From there, school-related coefficients were inspected and as we traced how their sign and magnitude changed as additional controls were introduced. This showed that the pooled hedonic school coefficient was highly unstable across specifications, making it unsuitable as our main estimate of the school premium.

Thus, the main method used to estimate the school premium was a school-specific regression discontinuity design (RDD) around the 1 km school boundary. This provides a more local comparison of flats just inside and outside the admission threshold, making it better suited than the pooled hedonic model for identifying school-related price effects. A pooled interaction-based RDD was then used to compare whether the boundary premium differs between good and non-good schools.

The experimental workflow has two layers: feature engineering and modelling.

At feature-engineering level, the sequence is:

Build school-boundary entities by joining school points to URA polygons.

Construct 1km and 2km school buffers and classify school tier.

Match resale address points to HDB polygons.

For each polygon-linked address, compute school exposure counts (school_count_, good_school_count_) by buffer intersection.

Compute accessibility features (nearest mall and MRT walking distance, and nearby amenity counts).

Export a transaction-level table with all engineered covariates.

At modelling level (Hedonic-Model branch), three complementary strategies are used:

Predictive and interpretable hedonic models: Ridge for predictive stability and OLS for coefficient interpretation;

School-specific boundary RDD: local linear specifications with multiple bandwidths and controlled versus uncontrolled variants;

Town-level regressions: separate models for heterogeneous premium estimation by town.

The hedonic model thus provides broad price prediction and association patterns, the RDD serves as the main local framework for estimating school premiums.

Method alternatives were considered but not prioritized in this phase:

Candidate approach Why not primary in this phase
Single pooled OLS only High interpretability but weaker predictive stability under multicollinearity
Tree boosting as core model Strong predictive power but weaker direct coefficient interpretability for policy-facing effect decomposition
Full causal design only (no predictive model) Better identification focus but loses practical forecasting and residual diagnostics benefits
One universal treatment premium Empirically inconsistent with town-level heterogeneity observed in outputs

These alternatives were tested conceptually, but did not improve model usefulness enough relative to the chosen stack in this project phase.

4. Findings

4.1 Results

School-Specific RDD Performance

School Name Group Inside_n Outside_n Premium ($) Mean Premium (%)
Admiralty Primary School good 832 261 7512.058518 1.475288
Ahmad Ibrahim Primary School non good 465 737 -2995.15129 -0.6386197
Ai Tong School good 331 428 -12607.09518 -2.2545344
Alexandra Primary School non good 730 357 9192.199613 1.197531

The results show substantial heterogeneity across schools. Most schools exhibit sizeable, positive local premiums, indicating that flats just inside the 1km boundary command higher resale prices than comparable flats just outside. Other schools show smaller or statistically weak effects, and a few show negative estimates. This heterogeneity suggests that the school premium is not uniform across Singapore, but depends heavily on the specific school and the local neighbourhood market. As a result, a single national school-premium estimate would mask meaningful local variation.

From hedonic outputs (hedonic_model/outputs/metrics.json):

Metric Value
Train R2 (log scale) 0.909
Test R2 (log scale) 0.915
Test RMSE (SGD) 58,395.84
Test MAE (SGD) 43,620.93
OLS premium estimate for good_school_within_1km -1.62%

‘Good’ School Effect

To assess whether ‘good’ schools command larger local housing premiums than ‘non-good’ schools, the school-specific RDD estimates were grouped according to our school classification and compared at the preferred controlled 100m bandwidth.

Under this specification, the mean estimated local boundary premium was approximately S$8,605 for good schools, compared with S$4,803 for non-good schools. This implies that the average local boundary premium was about S$3,802 higher for good schools.

However, a higher average does not necessarily imply a statistically significant difference. To test this more formally, a pooled interaction-based RDD was estimated at the transaction level, where the treatment effect at the school boundary was allowed to vary according to whether the school was classified as ‘good’. Under the controlled 100m specification, this interaction model estimated an additional premium of approximately 0.35%, or around S$1,837 at the local mean resale price, for ‘good’ schools relative to ‘non-good’ schools. However, this difference was not statistically significant at conventional levels (p = 0.138).

Taken together, these results suggest that the premium near good schools is descriptively larger in dollar terms, but the evidence is not strong enough under the most local controlled specification to conclude that the difference between ‘good’ and ‘non-good’ schools is statistically significant. This may reflect substantial heterogeneity across schools, as well as the fact that even within narrow bandwidths, local housing resale prices remain noisy and uneven.

4.2 Discussion

From a business perspective, the results suggest that proximity to high-demand primary schools does influence HDB resale prices, but the effect is modest, uneven, and highly dependent on local context. This implies that school-related housing demand may contribute to resale price pressure in specific neighbourhoods. Thus, policy responses should be targeted and location-sensitive, rather than based on the assumption that all good schools generate the same level of housing-market distortion.

The technical results also translate into two forms of business value. First, the hedonic model achieved strong predictive performance, with out-of-sample above 0.9, allowing the system to provide credible resale price estimates for policy exploration and scenario analysis. Second, the RDD findings show that while good schools have a higher average local premium descriptively, the difference is not always statistically significant under the strictest local controlled specification. This means that school access is possibly relevant to resale prices and identifies where localised pressure may exist, but is likely also affected and eclipsed by other location factors. Thus, a point of focus going forward lies in improving visibility into where school-related demand may matter, while the main cost is that policy interpretation must remain cautious and specification-aware.

4.3 Recommendations

For the next phase, the project should not yet be treated as a fully operationally-ready engine for estimating school premiums. Instead, it is better as a decision-support and exploratory analysis tool. The predictive hedonic model is strong enough to support internal price estimation and scenario exploration, while the school-specific RDD results are useful for highlighting where local school-related housing pressure may be present, but only to be used with caution rather than as a direct basis for policy intervention.

From a data and modelling perspective, several improvements could strengthen future versions. First, adopt a more tiered reporting standard for policy users by presenting pooled estimates alongside town-specific ranges and uncertainty intervals, so that results are not over-generalised into one universal premium. Additionally, the RDD analysis would benefit from stronger local comparability checks and if possible, more precise geolocation than address-point matching. Future work could potentially also prioritise richer upstream datasets, especially variables that are currently missing but likely important for price formation, such as housing conditions or more time-sensitive school demand information.

Overall, we recommended retaining the current system as a useful internal analytical tool, while treating further deployment, richer data collection, and stronger identification or causal checks as the priority for the next project phase.

4.4 Limitations, Bias Risks, and Mitigations

This study has several limitations. First, the definition of a ‘good’ primary school is based on observable proxies and may not fully capture how households perceive school quality. Different families may value schools for different reasons, including reputation, academic outcomes, or programme offerings, some of which are not directly observable or easily quantifiable. To mitigate this, the classification was designed to combine both institutional markers and revealed demand, but it should still be interpreted as an operational grouping rather than a definitive ranking.

Second, the analysis assumes reasonable cross-agency consistency across HDB, MOE, URA, and geospatial datasets, such that land use, school locations, and building references align sufficiently for feature construction. The walking-distance variables also assume that the underlying routing and map layers are accurate and up to date. In practice, the project uses currently available geospatial layers as proxies for conditions over the 2017–2025 transaction period, which introduces possible temporal mismatch. For example, some amenities, routes, or transport connections used in the model may not have existed in the same form at the time of earlier transactions. This risk was mitigated by using consistent and well-documented data sources, but some measurement error may remain.

Finally, some housing characteristics that may materially affect resale prices are not observed in the available data, including renovation quality, interior condition, unit orientation, and view. As a result, some residual variation in resale prices remains unexplained. This limitation is partly addressed by the strong predictive performance of the final hedonic model, but it cannot be eliminated entirely without richer property-level data.

5. System Architecture

5.1 Overview

The intended architecture has four layers: frontend, FastAPI backend, offline model artifacts, and an LLM query layer. The backend (api branch) handles retrieval, filtering, and inference; artifacts are precomputed offline and loaded at runtime.

5.2 Model Serving

Models are trained offline and served from serialized artifacts (ridge_pipeline.pkl, metrics.json, ols_coefficients.csv, and RDD/town-premium tables under data/ on the api branch). Prediction requests are validated, defaults are applied for missing fields, features are engineered, and ridge inference is executed.

5.3 LLM Interface

flowchart LR
    U["User Question"] --> F["Frontend Chat Input"]
    F <--> L["LLM Parser"]
    L --> T["Agent Tools"]
    T --> B["Backend API"]
    B <--> D["Model/data look up"]
    B --> L
    F --> R["User Response"]

Natural-language queries are mapped to endpoint intents and routed to backend/model outputs. Flow: User -> LLM/parser -> backend -> model/artifact -> response.