An experimental project investigating how LSTMs handle long-term dependencies through character-level Python code generation. This project uses code as a testbed because it naturally contains complex long-range dependencies like matching parentheses, consistent indentation, and nested structures.
Long-term dependencies pose a fundamental challenge for recurrent neural networks. While LSTMs were designed to address the vanishing gradient problem and capture longer-range dependencies, their practical limits remain an open question.
Why Python code generation?
Python code provides an ideal benchmark for testing long-term dependency learning because it requires the model to:
- Match opening and closing delimiters (parentheses, brackets, braces) across potentially hundreds of characters
- Maintain consistent indentation across function and class bodies spanning multiple lines
- Remember context for variable names, function signatures, and control flow structures
- Handle nested structures like nested function calls, list comprehensions, and conditional blocks
These characteristics make code generation a challenging and revealing test case for LSTM capabilities.
LSTMs must track several types of long-range dependencies when generating syntactically valid Python:
def function(arg1, arg2, nested_call(
another_function(x, y, z),
more_args
)): # Must close all parentheses correctlydef outer_function():
# 4 spaces
if condition:
# 8 spaces
for item in list:
# 12 spaces - must maintain this depth
process(item)def example():
"""
This docstring spans multiple lines
with various content in between
""" # Must close with triple quotesif condition:
# ... many lines ...
else:
# Model must remember we're in an if-else blockThe model uses a character-level LSTM architecture with the following components:
Input (batch_size, seq_length)
↓
Embedding Layer (vocab_size → 64 dims)
↓
LSTM Layer 1 (256 units, return_sequences=True)
↓
Dropout (0.1)
↓
LSTM Layer 2 (256 units, return_sequences=True)
↓
Dropout (0.1)
↓
Dense Layer (vocab_size, L2 regularization)
↓
Output (character predictions)
| Parameter | Value |
|---|---|
| Sequence Length | 256 characters |
| Batch Size | 512 |
| Embedding Dimension | 64 |
| LSTM Units (per layer) | 256 |
| Number of LSTM Layers | 2 |
| Dropout Rate | 0.1 |
| Learning Rate | 1.8e-4 |
| Optimizer | Adam (clipnorm=1) |
| L2 Regularization | 1e-4 |
| Training Epochs | 300 |
| Precision | Mixed Float16 |
- Character-level tokenization: Captures all syntactic details including whitespace and punctuation
- Stateful inference: Maintains hidden states across generation steps for consistent context
- Dropout + L2 regularization: Prevents overfitting on code patterns
- Gradient clipping: Stabilizes training on long sequences
- Mixed precision training: Enables larger batch sizes with limited GPU memory
- Size: ~165,000 lines of Python source code (~6 MB)
- Tokenization: Character-level (including all ASCII letters, digits, punctuation, whitespace)
- Vocabulary: ~95 unique characters
- Split: 95% training, 5% validation
- Sequence windows: 256 characters with 1-character shift for target
exploring-RNNs/
├── README.md # This file
└── python-code-generator/
├── train.py # Training script with model definition
├── inference.py # Stateful text generation
├── data.txt # Training corpus (~165k lines)
├── trained_model.keras # Saved model checkpoint
└── generated.txt # Sample generated outputs
pip install tensorflow numpy matplotlib- Python 3.7+
- TensorFlow 2.x
- NumPy
- Matplotlib (for training visualization)
- GPU with CUDA support (recommended)
- Mixed precision (float16) enabled
- Adjust
memory_limitintrain.py:17for different GPU configurations
cd python-code-generator
python train.pyThe training script will:
- Load and preprocess
data.txt - Create character-level vocabulary mappings
- Build and compile the LSTM model
- Train for 300 epochs with validation
- Save the best model to
model_checkpoint.keras
Training progress includes loss and accuracy metrics for both training and validation sets.
python inference.pyYou'll be prompted for:
- Number of characters to generate: How long the output should be (e.g., 1000)
- Temperature: Controls randomness (0.1 = conservative, 1.0 = creative)
- Start string: Initial prompt (e.g., "def function")
Temperature controls the randomness of character sampling:
- Low (0.1-0.3): More deterministic, follows common patterns, syntactically safer
- Medium (0.4-0.7): Balanced creativity and structure
- High (0.8-1.0): More creative but potentially less syntactically valid
The inference model maintains LSTM hidden states across generation steps, allowing it to:
- Remember opening brackets and quotes
- Maintain indentation context
- Track variable and function names
- Preserve overall code structure
Below is a sample of generated code (temperature=0.5, 1000 characters):
import the context by an invalid target.
# This is array of the start link of the first because the same context provided.
# if the item of the second back to the {backends to return the non-dataset values and which match
# description is set the array.
if default_scores in self._namespaces:
self._info._get_classe(result, attrs)
else:
self.module_path = task
else:
self._parse_active(
self._tags,
self.filter_state, self.services,
self._selected_timestamp.test_text, test_to_text, timestamp)
def async_remove_module(self, context, context):
"""
Return the test that it uses themes.
:param str: Any of the resource string is used in the contents of the
are of the report loader of the module.
"""
if self.parent_id is not None:
return self._get_input("index %s interface")
except IndexError:
# Comments about collections and strings
if not is is None:
try:
if line.split() == '1':
self._set_index(self._dist, value)
return False
def test_data_to_internal(self, bool):
self._set_context_components()
self.assertEqual(resultset, "--2000 * scheme)
self.assertEqual(len(sub_socket, timeout))
self.assertEqual(results['instanceed_id', '_limit_method'])What the model does well:
- Generates plausible function definitions with
defkeyword and colons - Maintains basic indentation patterns within small blocks
- Creates docstrings with opening/closing triple quotes
- Produces variable names and method calls that look "Python-like"
- Generates syntactically valid assertions and comments
- Remembers to close some parentheses and brackets
Where long-term dependencies break down:
- Logical inconsistencies (e.g.,
returnafterifwithout propertryblock) - Variable scope errors (using undefined variables)
- Unmatched delimiters over longer distances
- Semantic incoherence (function behavior doesn't match docstrings)
- Indentation drift in deeply nested structures
- Invalid syntax combinations (e.g.,
if not is is None)
Interesting behaviors:
- Creates plausible-sounding method names (
_parse_active,test_data_to_internal) - Generates realistic comment patterns explaining code intent
- Mimics common Python idioms (list comprehensions, context managers)
- Produces test-like code with
assertEqualpatterns - Exhibits vocabulary from the training domain (dataset, context, state, etc.)
The LSTM successfully learns patterns within ~10-50 characters:
- Function signatures with parameters
- Simple if-else blocks
- Single-line statements
- Basic indentation after colons
The model struggles with dependencies spanning >100 characters:
- Matching opening/closing brackets separated by multiple lines
- Maintaining consistent indentation across entire functions
- Remembering variable definitions from earlier in the sequence
- Coherent logical flow across function bodies
This experiment demonstrates the practical limits of LSTM architectures for tasks requiring very long-range context. While the 256-unit LSTM cells can theoretically maintain information across sequences, in practice:
- Information degrades over distances >100 characters
- Syntax is easier than semantics - the model learns surface patterns better than logical structure
- Local coherence beats global coherence - individual lines look valid, but overall structure may not
This project was inspired by research on:
- LSTM capabilities for long-term dependency learning
- Neural code generation and program synthesis
- Character-level language modeling
- Applications of RNNs to structured data (code, markup languages)
This is an experimental and educational research project for exploring LSTM capabilities.
Training data consists of open-source Python code. This project is for educational and research purposes only.