Lecture 3

The objectives of this lecture are to:

  1. Learn how to perform matrix operations with NumPy.
  2. File input/output for numerical data.

Much of the lecture is based on this tutorial: http://wiki.scipy.org/Tentative_NumPy_Tutorial

Linear Algebra

As we learned in the previous lectures, arithmetic operations using arrays are element-wise. In order to perform matrix operations, NumPy implements a matrix object that is derived from the array object.

The matrix object

In order to create a matrix object there are two general methods:

  1. Copy values from an existing array object.
  2. Create a matrix and initialize it as you would an array.

NumPy provides some special conveniences for initializing matrix objects which you will find useful,

In [ ]:
from numpy import *


# Method 1, *copy* data from an existing array
A = array([[1.0, 2.0], [3.0, 4.0]])
A = matrix(A)
print(type(A))
print(A)

print(8*'-')

# Method 2a, provide matrix element values in an array()-like syntax
A = matrix([[1.0, 2.0], [3.0, 4.0]])
print(A)

print(8*'-')

# Method 2b, provide matrix element values in a more compact format similar to MATLAB/Octave/Scilab etc
A = matrix('1.0 2.0; 3.0 4.0')
print(A)

You will notice that displaying matrix data is exactly the same as a 2D array. For the most part, using Method 1 is inefficient if your objective is to use matrix algebra from the beginning. There is one case where Method 1 is useful; NumPy provides several built-in functions to generate special arrays,

In [ ]:
# the output of all of NumPy's built-in functions are arrays, they are more general than matricies
B = ones((3,3))
print(type(B))

# instead of creating different built-in functions, you should instead initialize a matrix using the 
# array output
B = matrix(B)
print(type(B))
print(B)

One important NumPy function is identity() in that we frequently need to use the identity matrix in matrix algebra,

In [ ]:
I = identity(3)
print(type(I))
print(I)

I = matrix(I)
print(type(I))
print(I)

Sometimes you will run into a situation where you want to perform both elementwise and matrix operations on the same data. Method 1 is not ideal in that it copies data from an existing array to make a matrix,

In [ ]:
Aarr = array([[1.0, 2.0], [3.0, 4.0]])
Amat = matrix(Aarr)

# Aarr's data will *not* be changed
Amat[0,0] = 0.0

print(Aarr)
print(Amat)

What solution to this problem would be consistent with past array methods? A view! NumPy provides two functions to create views of array and matrix objects for this exact purpose: asmatrix() and asarray,

In [ ]:
Aarr = array([[1.0, 2.0], [3.0, 4.0]])
Amat = asmatrix(Aarr)

# Aarr's data *will* be changed
Amat[0,0] = 0.0

print(Aarr)
print(Amat)

Matrix operations and linear algebra

Arithmetic operations with matrix objects results in matrix algebra operations that you are familiar with,

In [ ]:
A = matrix('1.0 2.0; 3.0 4.0')
I = asmatrix(identity(2))

print(A+I)
print(8*'-')
print(A-I)
print(8*'-')
print(A*I)
print(8*'-')
print(A**2)

Additionaly, there are many attributes of the matrix class that are defined for convenience,

In [ ]:
# the `T` attribute of a `matrix` is a *view* of its transpose
print(A.T)

print(8*'-')

# the `I` attribute of a `matrix` is a its inverse, it is computed when needed (for obvious reasons)
print(A*A.I)

For extended functionality, such as factorizations and decompositions, the linalg module of NumPy provides linear algebra functionality,

In [ ]:
# find the determinant of a matrix
linalg.det(A)
In [ ]:
# find the inverse of a matrix, equivalent to A.I
linalg.inv(A)
In [ ]:
# eigendecomposition of a square matrix
linalg.eig(A)
In [ ]:
# QR factorization, note that a more robust method is used than Gram-Schmit
linalg.qr(A)
In [ ]:
# solving a linear system A*x = b, as you learned in your Linear Algebra course
# this is more efficient than A.I*B
b = matrix("1.0 ; 2.0")
linalg.solve(A,b)

File Input/Output in NumPy

NumPy provides a relatively simple but powerful set of functions for reading and writing numerical data files:

  • save(filename, array_like)
  • savez(filename, label1 = array_like1, label2 = array_like2, ...)
  • load(filename)
  • savetxt(filename, array_like, format)
  • loadtxt(filename)

You will learn about the basic usage of these functions and one intermediate-level concept, memory-mapped file access, that is useful for dealing with very large data-sets.

Saving and loading single NumPy arrays

The most simple approach to saving numerical data with NumPy is to save each array in its own file,

In [ ]:
# let's create a relatively large array
data = random.random((1000,1000))

# create a `string` object containing the desired file name
filename = "my_array"

# use the `save()` function
save(filename, data)

After executing this code, you will find a file "my_array.npy" is your current working directory. If you try and open the file, you will find that is it a raw binary file and is not directly readable. In many cases you might want to be able to read the data without opening a Python session; this can be accomplished with savetxt() but this convenience is not without a cost...

In [ ]:
# save the array data to a human-readable text file
savetxt(filename+".txt", data)

If you look in your working directory, there is a new file "my_array.txt". The savetxt() function does not add a filetype suffix so you may choose what is appropriate (.txt, .csv, etc). In this case I am using the default formatting to output the data, but there is quite a bit of flexibility in the formatting that you may configure. This option is especially useful for exporting data to another software package such as spreadsheet software.

The cost for this readability is the size of the data file; in your working directory type ls -lh my_array.*,

$ ls -lh my_array.*
-rw-rw-r-- 1 nasser nasser 7.7M Jun 20 15:47 my_array.npy
-rw-rw-r-- 1 nasser nasser  24M Jun 20 16:08 my_array.txt

The size of the text file is almost three times that of the binary file. The text representation of each value in the array clearly requires more data than the direct binary representation and a delimiter (a space, comma, etc) is required in the text file for the reader to distinguish each value. Thus the vast majority of the time we will use the more efficient save() function.

Loading saved files is a relatively simple task, if the file is in the binary ".npy" format use load(),

In [ ]:
array1 = load(filename+".npy")

print(array1.shape)

Else if it is in a text format use loadtxt(),

In [ ]:
array2 = loadtxt(filename+".txt")

print(array2.shape)

In the previous examples we have chosen to use the array datatype, but any array-like data type is compatible with these functions such as matrix.

As you have seen, single array input/output is very very simple, but frequently we wish to save many arrays and doing so in a single file would be convenient.

Saving a loading multiple NumPy arrays

Saving and loading data files with multiple arrays is slightly more complicated and can be accomplished using savez() to save and load() (as before),

In [ ]:
# using array1 and array2 from the previous example
savez("my_arrays", large_array=array1, small_array=array1[::10, ::10])

# an equivalent function call, but now the keyword for array1 will default to arr_0
# and for array2 will default to arr_1
#savez("my_arrays", array1, array1[::10, ::10])

As before, the initial argument is the file name, but now additional arguments require assigning the arrays you want to store to keywords. As many arrays as you wish may be assigned and the keywords are optional. If they are not provided a default label arr_0, arr_1, ... are assigned. Check your working directory and you will find a data file "my_arrays.npz", where the ".npz" suffix denotes that this is a different format than ".npy".

This saves the array data in an uncompressed format, the variant savez_compressed() only differs in that it saves the data in a compressed format. Typically with numerical data, compression does not perform well unless there are many repeated values.

To load the data, use the same syntax as before,

In [ ]:
arrays = load("my_arrays"+".npz")

type(arrays)

Now the load() function returns a single NpzFile object which is derived from the dict object. The NpzObject behaves very similar to a dictionary where the keys are the keywords from the savez() call and the data associated with the keys are the arrays that were stored.

In [ ]:
# the list of keys is useful
keys = arrays.keys()
print(keys)

array1 = arrays[keys[0]]
array2 = arrays[keys[1]]

print(array1.shape)
print(array2.shape)

Working with large arrays -- memory-mapped data access

The load() function allows memory-mapped access to data files where only a small segment of a file on-disk is loaded into main memory. For large to very large files this results in significant speed-up of array manipulations. A simple example of this is an array containing time series data of the state of a system. We typically will need to manipulate one or a few snapshots at a time, so there is no need to load the whole array at once. The memory-mapping functionality of NumPy is transparent to the user; simply enable it through setting the keyword mmap_mode of the load() function to one of these four values,

`r`Open existing file for reading only.
`r+`Open existing file for reading and writing.
`w+`Create or overwrite existing file for reading and writing.
`c`Copy-on-write: assignments affect data in memory, but changes are not saved to disk. The file on disk is read-only.

The purpose of "r" and "r+" are obvious, but "w+" and "c" require some explanation. The "c" option us useful when you want to manipulate and explore array data without affecting the data on-disk. Using this option frequently is good practice so that you do not accidentilly change important data that you have stored. Consider removing filesystem write access for an extra measure of caution,

$ touch file
$ ls -l file
-rw-rw-r-- 1 nasser nasser 0 Jun 20 17:14 file
$ chmod -w file
$ rm file
rm: remove write-protected regular empty file ‘file’?

The operating system will not allow you to modify this file and removing it requires an extra confirmation! The w+ is similar to the r+ option, except it overwrites the existing file, so it is rarely used.

This functionality only works with single array ".npy" files, so let's open the "my_array.npy" file from the previous example,

In [ ]:
# load the array stored in "my_array.npy" for read-only access the using memory-mapping mode
arr_mmap = load("my_array.npy",mmap_mode='r')

type(arr_mmap)

The memmap object is array-like, we may manipulate it directly but it is more understandable to use asarray() or asmatrix() to create a view of it that has a familiar type,

In [ ]:
arr = asarray(arr_mmap)
print(type(arr))

mat = asmatrix(arr_mmap)
print(type(mat))