top of page
Search
  • Writer's pictureBharath Reddy

Best way to read a file is not to read it at all - Python Iteration Protocol




The concept of “iterable objects” is relatively recent in Python, but it has come to permeate the language’s design. an object is considered iterable if it is either a physically stored sequence, or an object that produces one result at a time in the context of an iteration tool like a for loop. In a sense, iterable objects include both physical sequences and virtual sequences computed on demand.


Any object with a __next__ method to advance to a next result, which raises StopIteration at the end of the series of results, is considered an iterator in Python. Any such object may also be stepped through with a for loop or other iteration tool, because all iteration tools normally work internally by calling __next__ on each iteration and catching the StopIteration exception to determine when to exit. Lets see this in an example:


Lets say we have a file called bharath.py which has three lines in it.

1: First line

2: Second line

3: Final line



>>> f = open('bharath.py') 
>>> f.__next__() 
'First line'
>>> f.__next__() 
'Second line'
>>> f.__next__()
'Final line'
>>> f.__next__()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration


The net effect of this magic is that, the best way to read a text file line by line today is to not read it at all—instead, allow the for loop to automatically call __next__ to advance to the next line on each iteration. The file object’s iterator will do the work of automatically loading lines as you go. The following, for example, reads a file line by line, printing the uppercase version of each line along the way, without ever explicitly reading from the file at all:



>>> for line in open('bharath.py'):
... print(line.upper(), end='')  #Calls __next__, catches StopIteration
...
FIRST LINE
SECOND LINE
FINAL LINE

print uses end=' ' here to suppress adding a \n, because line strings already have one. Without this, our output would be double-spaced. It reads one line at a time, the iterator-based version is immune to memory-explosion issues where the file size exceeds the size of memory available on the machine on which code is running.


To simplify manual iteration code, Python also provides a built-in function, next, that automatically calls an object’s __next__ method.



>>> f = open('bharath.py')
>>> next(f)            # The next(f) built-in calls f.__next__() 
'First line'
>>> next(f)     
'Second line'

Well the way this works internally in python is a little more nuanced. When the for loop begins, it first obtains an iterator from the iterable object by passing it to the iter built-in function; the object returned by iter in turn has the required next method. The iter function internally runs the __iter__ method, much like __next__.


As a more formal definition, the full iteration protocol, is really based on two objects, used in two distinct steps by iteration tools:

  1. The iterable object you request iteration for, whose __iter__ is run by iter

  2. The iterator object returned by the iterable that actually produces values during the iteration, whose __next__ is run by next and raises StopIteration when finished producing results

These steps are orchestrated automatically by iteration tools in most cases, but it helps to understand these two objects’ roles. For example, in some cases these two objects are the same when only a single scan is supported (e.g., files), and the iterator object is often temporary, used internally by the iteration tool.


Too heavy ? - Lets simplify this using a simple example.



>>> L = [1, 2, 3]
>>> I = iter(L)   
# Obtain an iterator object from an iterable which is list 'L' 
>>> I.__next__()    # Call iterator's next to advance to next item
1
>>> I.__next__()       
2            
>>> I.__next__()
3
>>> I.__next__()
...error text ...
StopIteration

This initial step is not required for files, because a file object is its own iterator.



>>> f = open('bharath.py')
>>> iter(f) is f
True
>>> iter(f) is f.__iter__()
True
>>> f.__next__()
'First line'

Lists and many other built-in objects, though, are not their own iterators



>>> L = [1, 2, 3]
>>> iter(L) is L
False
>>> L.__next__()
AttributeError: 'list' object has no attribute '__next__'
>>> I = iter(L)
>>> I.__next__()
1
>>> next(I)       # Same as I.__next__()
2

So now you understand what happens when you write



>>> L = [1, 2, 3]
>>> for X in L:
        print(X, end=' ')

Yes - its exactly the same as iter(L) and then I.__next() in a loop until python catches error at end of the loop - remember, iterators raise an error StopIteration when it is done iterating over the object on which it is run.


This iteration protocol is core to many python superstar functionalities like list comprehensions,range, dictionary keys, dictionary values & generators.

I hope you now intuitively understand the python iteration protocol and how any of the loops in python works and can appreciate how elegantly these are implemented in python. If you liked this article - please do share ahead using the social icons below.

2,715 views0 comments

Recent Posts

See All
bottom of page