How to read this dashboard

The grid is the power exchanged with the utility: P_grid = P_load − P_PV + P_battery. Positive = importing (you pay), negative = exporting (you get paid).
The agent is a Deep Q-Network. Each 15 min it picks a battery charge/discharge power to minimise the daily electricity bill while keeping the grid exchange smooth and the battery within safe state-of-charge limits.
The strategy it learns is price arbitrage: charge the battery when electricity is cheap (or negatively priced), discharge it when electricity is expensive — shifting the factory's demand in time.
The benchmark (LP) is a perfect-foresight Linear Program: the theoretical cheapest possible bill, dashed in the charts. The agent has no future knowledge beyond day-ahead prices, so closing the gap to the LP is the headline result.
▶ Press Play below to animate any day step-by-step — the live panel narrates what the agent is doing and why at each 15-minute step.

Pick a test day

00:00 speed

Press ▶ Play to watch the agent run the battery through the day, minute by minute.

Price now

—

Agent decision

—

Battery SOC

—

Grid (agent)

—

LP optimum here

—

1 · The grid — what the utility sees

Flatter and lower is better. The agent (blue) reshapes the raw no-battery profile (orange). The dashed LP optimum is the best any controller could do with perfect foresight — how close the agent hugs it is the whole story.

2 · The agent's decisions — battery power & state of charge

Green bars above zero = charging (storing energy); below zero = discharging (releasing it). The white line is the battery's state of charge; the dashed line is what the LP would have done. Agent and LP filling/emptying at the same times = the agent learned the optimal timing.

3 · Context — price, load and solar driving the decision

The agent watches the day-ahead price (red): notice the battery charges into the price dips and discharges into the peaks.

Whole test set — cumulative electricity cost

The widening gap between the two curves is the money the agent saves over the full Oct–Dec test period.

How the IEMS works — the whole system, end to end

Every number and equation below is taken directly from the project code (config.py, env.py, dqn.py, lp_benchmark.py). Nothing here is illustrative-only.

1 · What problem are we solving?

An Industrial Energy Management System (IEMS) decides, every 15 minutes, how a factory should use a big battery to cut its electricity bill. The factory has three energy sources/sinks tied together at one electrical bus:

a fixed industrial load it must serve (it cannot choose when to consume),
a solar PV array (≈131 kW peak) whose output depends only on the weather,
a utility grid connection with a price that changes every hour (German day-ahead prices — they can even go negative).

The one thing we can control is a 1600 kWh battery (ESS) rated at 400 kW. Charging it when power is cheap and discharging it when power is expensive shifts the factory's demand in time and lowers the bill — that is the entire job of the agent.

2 · System architecture

The agent only commands the green battery flow. Everything else is exogenous (decided by weather, the factory, and the market). The grid flow is whatever is left over to keep the bus balanced — so by moving the battery, the agent indirectly controls the bill.

3 · The physical model (exact equations from `env.py`)

Power balance: P_grid(t) = P_load(t) − P_PV(t) + P_battery(t). Positive grid = importing (pay), negative = exporting (get paid).
Battery state of charge (round-trip efficiency η=0.95): charging ΔSOC = P_b·η·Δt/E_b, discharging ΔSOC = (P_b/η)·Δt/E_b, with E_b=1600 kWh, Δt=0.25 h. SOC is kept in [15%, 100%].
Cost of a step: price·P_grid·Δt when importing; when exporting it is credited at β·price with β=0.9 (you sell back a bit cheaper than you buy). Negative prices flip the sign — importing during them earns money.
Episode: one day = 96 × 15-min steps. SOC starts at 50% and a terminal penalty forces it back to ≥50% by midnight, so the agent cannot cheat by draining the battery dry.

4 · How Deep Reinforcement Learning works here

Reinforcement learning frames control as an agent interacting with an environment in a loop. There is no labelled "correct answer" — the agent learns purely from a reward signal by trial and error.

The three ingredients, exactly as coded:

State sₜ (the agent's eyes — numbers): battery SOC, time-of-day & day-of-week (as sin/cos), the current price & net load, the last 24 h of prices and net load, and (improved preset) the next 24 h of day-ahead prices — which are published in advance, so using them is legitimate, not cheating.
Action aₜ: a discrete battery power setpoint. The improved agent picks from 21 levels between −200 and +200 kW; the thesis agent has only 3 (charge / idle / discharge). Infeasible actions are clipped so SOC limits are never broken.
Reward rₜ = −(σ₁·cost + σ₂·penalty + σ₃·smoothing): the agent is penalised by the euro cost of the step (σ₁=1), by any constraint violation (σ₂=10), and by jerky step-to-step swings in grid power (σ₃=20 for improved). Maximising reward = minimising the bill while keeping the grid draw smooth and feasible.

Over hundreds of thousands of steps the agent discovers a policy — a mapping from state to action — that earns the most reward, i.e. the cheapest, smoothest battery schedule. The strategy it converges on is price arbitrage: charge in the cheap hours, discharge in the expensive ones (watch it happen in the animation above).

5 · The DQN agent in detail

"DQN" = Deep Q-Network. A Q-value Q(s,a) estimates the total future reward of taking action a in state s. If we know Q accurately, the best policy is simply "pick the action with the highest Q." DQN learns Q with a neural network (from dqn.py):

Network: a multilayer perceptron — input state → two hidden layers of 128 ReLU units → one Q-value per action. The improved agent adds a Dueling head (separates "how good is this state" from "how good is each action").
Experience replay: every transition (s, a, r, s′) is stored in a 100 000-step buffer; training samples random minibatches of 128 from it, which breaks correlations and stabilises learning.
Target network: a slowly-updated copy of the network provides the learning target, refreshed every 1000 gradient steps — this stops the agent chasing a moving target.
Double DQN (improved): uses the online network to choose the next action and the target network to value it, which removes DQN's tendency to over-estimate.
Exploration: ε-greedy — the agent acts randomly with probability ε, which decays from 1.0 → 0.05 over the first 100 000 steps, so it explores early and exploits later.
Optimiser: Adam, learning rate 5e-4, discount γ=0.99, Huber (smooth-L1) loss, gradient clipping at 10.

For the headline result we train 5 independent agents (different random seeds) and report the average, because a single RL run is noisy — that is what makes the numbers trustworthy.

6 · What we compare against — the two benchmarks

⊘ No-battery baseline

The factory with its solar but no battery and no control: it simply imports/exports whatever the net load is, at the market price. This is the "do nothing" reference — the bill the agent has to beat. Lower bound on effort, upper bound on cost.

◆ Perfect-foresight LP optimum

A Linear Program (scipy HiGHS) that is told the entire day's prices, load and PV in advance and computes the mathematically cheapest possible battery schedule. No real controller can beat it. The theoretical best — the target.

So the agent is sandwiched between the two: it should be far below the no-battery baseline (big saving) and as close as possible to the LP optimum (near-perfect). Our result:

7 · How we built it — the pipeline

Data (data.py): a full year (2016) of 15-min load, PV and price data is loaded and converted to kW. It is split by time — train on Jan→Sep, test on Oct→Dec — so the agent is always evaluated on days it never trained on (no data leakage).
Environment (env.py): the battery/grid physics above, wrapped as a Gym-style step()/reset() simulator running ~500 steps/second.
Training (train.py, run_best.py): the DQN agent interacts with the training months for ~100k–250k steps, learning the policy by trial and error (the headline improved run averages 5 independent seeds).
Evaluation (evaluate.py): the trained policy is rolled out greedily on the held-out test months and scored on real KPIs — cost, peak, variability — against both benchmarks.
This dashboard (dashboard.py): re-runs that rollout and renders every chart you see from the actual step-by-step results.

One-sentence summary: we built a fast simulator of a solar-plus-battery factory, trained a Deep Q-Network to drive the battery by trial-and-error against real 2016 price data, and it learned — on days it had never seen — to cut the electricity bill by about a third, landing within single digits of the perfect-foresight optimum while keeping the grid peak unchanged.

Generated by iems_drl/dashboard.py. Self-contained — Plotly is embedded, no internet required.