Held-out test set (Oct–Dec 2016). The DQN agent controls only the
battery; the grid import/export is whatever is left over after the battery and
solar meet the factory load.
How to read this dashboard
The grid is the power exchanged with the utility:
P_grid = P_load − P_PV + P_battery. Positive = importing (you pay),
negative = exporting (you get paid).
The agent is a Deep Q-Network. Each 15 min it picks a battery
charge/discharge power to minimise the daily electricity bill while keeping
the grid exchange smooth and the battery within safe state-of-charge limits.
The strategy it learns is price arbitrage: charge
the battery when electricity is cheap (or negatively priced), discharge
it when electricity is expensive — shifting the factory's demand in time.
The benchmark (LP) is a perfect-foresight Linear Program: the theoretical
cheapest possible bill, dashed in the charts. The agent has no future knowledge beyond
day-ahead prices, so closing the gap to the LP is the headline result.
▶ Press Play below to animate any day step-by-step — the live panel narrates
what the agent is doing and why at each 15-minute step.
Pick a test day
00:00speed
Press ▶ Play to watch the agent run the battery
through the day, minute by minute.
Price now
—
Agent decision
—
Battery SOC
—
Grid (agent)
—
LP optimum here
—
1 · The grid — what the utility sees
Flatter and lower is better. The agent (blue) reshapes the
raw no-battery profile (orange). The dashed LP optimum is the
best any controller could do with perfect foresight — how close the agent hugs it is the
whole story.
2 · The agent's decisions — battery power & state of charge
Green bars above zero = charging (storing energy); below zero =
discharging (releasing it). The white line is the battery's state of charge; the dashed
line is what the LP would have done. Agent and LP filling/emptying at the same times =
the agent learned the optimal timing.
3 · Context — price, load and solar driving the decision
The agent watches the day-ahead price (red): notice the
battery charges into the price dips and discharges into the peaks.
Whole test set — cumulative electricity cost
The widening gap between the two curves is the money the agent
saves over the full Oct–Dec test period.
How the IEMS works — the whole system, end to end
Every number and equation below is taken
directly from the project code (config.py, env.py,
dqn.py, lp_benchmark.py). Nothing here is illustrative-only.
1 · What problem are we solving?
An Industrial Energy Management System (IEMS) decides, every 15 minutes, how a
factory should use a big battery to cut its electricity bill. The factory has three
energy sources/sinks tied together at one electrical bus:
a fixed industrial load it must serve (it cannot choose when to consume),
a solar PV array (≈131 kW peak) whose output depends only on the weather,
a utility grid connection with a price that changes every hour (German
day-ahead prices — they can even go negative).
The one thing we can control is a 1600 kWh battery (ESS) rated at 400 kW.
Charging it when power is cheap and discharging it when power is expensive shifts the
factory's demand in time and lowers the bill — that is the entire job of the agent.
2 · System architecture
The agent only commands the green battery flow. Everything
else is exogenous (decided by weather, the factory, and the market). The grid flow is
whatever is left over to keep the bus balanced — so by moving the battery, the agent
indirectly controls the bill.
3 · The physical model (exact equations from env.py)
Battery state of charge (round-trip efficiency η=0.95): charging
ΔSOC = P_b·η·Δt/E_b, discharging ΔSOC = (P_b/η)·Δt/E_b,
with E_b=1600 kWh, Δt=0.25 h. SOC is kept in [15%, 100%].
Cost of a step:price·P_grid·Δt when importing; when exporting it is
credited at β·price with β=0.9 (you sell back a bit cheaper than you buy).
Negative prices flip the sign — importing during them earns money.
Episode: one day = 96 × 15-min steps. SOC starts at 50% and a terminal penalty
forces it back to ≥50% by midnight, so the agent cannot cheat by draining the battery dry.
4 · How Deep Reinforcement Learning works here
Reinforcement learning frames control as an agent interacting with an
environment in a loop. There is no labelled "correct answer" — the agent learns
purely from a reward signal by trial and error.
The three ingredients, exactly as coded:
State sₜ (the agent's eyes — numbers): battery SOC, time-of-day &
day-of-week (as sin/cos), the current price & net load, the last 24 h of prices and
net load, and (improved preset) the next 24 h of day-ahead prices — which are
published in advance, so using them is legitimate, not cheating.
Action aₜ: a discrete battery power setpoint. The improved agent picks from
21 levels between −200 and +200 kW; the thesis agent has only 3
(charge / idle / discharge). Infeasible actions are clipped so SOC limits are never broken.
Reward rₜ = −(σ₁·cost + σ₂·penalty + σ₃·smoothing): the agent is penalised by
the euro cost of the step (σ₁=1), by any constraint violation (σ₂=10), and by jerky
step-to-step swings in grid power (σ₃=20 for improved). Maximising reward =
minimising the bill while keeping the grid draw smooth and feasible.
Over hundreds of thousands of steps the agent discovers a policy — a mapping from
state to action — that earns the most reward, i.e. the cheapest, smoothest battery
schedule. The strategy it converges on is price arbitrage: charge in the cheap
hours, discharge in the expensive ones (watch it happen in the animation above).
5 · The DQN agent in detail
"DQN" = Deep Q-Network. A Q-value Q(s,a) estimates the total future reward of taking
action a in state s. If we know Q accurately, the best policy is simply "pick
the action with the highest Q." DQN learns Q with a neural network (from dqn.py):
Network: a multilayer perceptron — input state → two hidden layers of 128 ReLU
units → one Q-value per action. The improved agent adds a Dueling head (separates
"how good is this state" from "how good is each action").
Experience replay: every transition (s, a, r, s′) is stored in a 100 000-step
buffer; training samples random minibatches of 128 from it, which breaks correlations and
stabilises learning.
Target network: a slowly-updated copy of the network provides the learning target,
refreshed every 1000 gradient steps — this stops the agent chasing a moving target.
Double DQN (improved): uses the online network to choose the next action and
the target network to value it, which removes DQN's tendency to over-estimate.
Exploration: ε-greedy — the agent acts randomly with probability ε, which decays
from 1.0 → 0.05 over the first 100 000 steps, so it explores
early and exploits later.
For the headline result we train 5 independent agents (different random
seeds) and report the average, because a single RL run is noisy — that is what makes the
numbers trustworthy.
6 · What we compare against — the two benchmarks
⊘ No-battery baseline
The factory with its solar but no battery and no control: it simply imports/exports
whatever the net load is, at the market price. This is the "do nothing" reference — the
bill the agent has to beat. Lower bound on effort, upper bound on cost.
◆ Perfect-foresight LP optimum
A Linear Program (scipy HiGHS) that is told the entire day's prices,
load and PV in advance and computes the mathematically cheapest possible battery
schedule. No real controller can beat it. The theoretical best — the target.
So the agent is sandwiched between the two: it should be far below the
no-battery baseline (big saving) and as close as possible to the LP optimum (near-perfect).
Our result:
7 · How we built it — the pipeline
Data (data.py): a full year (2016) of 15-min load, PV and price data is
loaded and converted to kW. It is split by time — train on Jan→Sep, test on
Oct→Dec — so the agent is always evaluated on days it never trained on (no data leakage).
Environment (env.py): the battery/grid physics above, wrapped as a
Gym-style step()/reset() simulator running ~500 steps/second.
Training (train.py, run_best.py): the DQN agent interacts
with the training months for ~100k–250k steps, learning the policy by trial and error
(the headline improved run averages 5 independent seeds).
Evaluation (evaluate.py): the trained policy is rolled out greedily on
the held-out test months and scored on real KPIs — cost, peak, variability — against both
benchmarks.
This dashboard (dashboard.py): re-runs that rollout and renders every
chart you see from the actual step-by-step results.
One-sentence summary: we built a fast simulator of a solar-plus-battery factory,
trained a Deep Q-Network to drive the battery by trial-and-error against real 2016 price
data, and it learned — on days it had never seen — to cut the electricity bill by about a
third, landing within single digits of the perfect-foresight optimum while keeping the grid
peak unchanged.
Generated by iems_drl/dashboard.py. Self-contained — Plotly is
embedded, no internet required.