I love C. I’ve written a little bit of C code in my time - both UNIX user land and kernel stuff. I co-wrote OpenBSD’s rum(4) i802.11a/b/g wireless driver for Ralink USB devices [article here] and also made large contributions to OpenRCS and OpenCVS [articles here, here and here]. I’m also the author of the small, portable and efficient BitTorrent implementation, Unworkable, which is part of our work at P2P Research. So I am relatively familiar with the language.
I’ve been hacking Python code for around two years now, really developing a taste for it from my day job. I would not consider myself a Python guru by any stretch, but I’ve worked with many different parts of the standard library, and used enough of the features (generators, lambdas, list comprehensions, classes etc) that I reckon I have a pretty solid handle on what it offers.
The majority of the crawling and data analysis software developed here at P2P Research is written in Python - with a little bit of C here and there, for performance. I suppose that the system features our stuff uses can be broken down into the following categories:
- String manipulation / parsing.
- Fast dynamic data structures. Lists and dictionaries, at a high level, including sorting etc.
- Networking. Specifically, a lot of HTTP is spoken.
- Threading. For increased throughput.
- File I/O. For archival purposes.
- Database. We use PostgreSQL for some reporting and analysis.
I’m going to do a brief comparison with each of these items, comparing the two languages. All these things can be achieved relatively straight forwardly with both C and Python. Consider how many network servers, text editors and databases are written purely in C. The POSIX and ANSI standards actually give you a pretty good set of library functions for doing these things, too - apart from the data structure area I suppose. There are mature interfaces available for working with databases.
What Python really gives you that C does not, in my opinion, are the following:
-
Largely eliminates the headaches of memory management.
-
Similarly, makes string manipulation much less painful, while maintaining much of C’s performance by interfacing directly with printf family of functions. Consider the following C snippet, followed by the Python equivalent:
1
2
3
4
5
6
7
8
9
10
11
12
13
| /* Format a HTTP 1.0 GET request safely in C */
l = snprintf(request, GETSTRINGLEN,
"GET %s%s HTTP/1.0\r\nHost: %s\r\nUser-agent: Unworkable/%s\r\n\r\n", path,
params, host, UNWORKABLE_VERSION);
if (l == -1 || l >= GETSTRINGLEN)
goto trunc;
/* ... */
trunc:
trace("announce: string truncation detected");
xfree(params);
xfree(request);
xfree(tparams);
return (-1); |
1
2
3
| # Format a HTTP 1.0 GET request safely in Python
request = "GET %s%s HTTP/1.0\r\nHOST: %s\r\nUser-agent: Unworkable/%s\r\n\r\n" %(path,
params, host, UNWORKABLE_VERSION) |
The big difference in this case, is really the amount of care you need to take with memory cleanup and error checking in C. Python is far more lenient when it comes to string and memory manipulation than C, which saves a great deal of complexity.
-
While there are good, relatively straight-forward implementations of various data structures for C, well-known examples being the venerable sys/queue.h for various sorts of linked lists, and the similar sys/tree.h for red-black trees or splay trees, typically used to implement dictionaries.
But these C macros, while extremely helpful, are still tricky. It is not obvious, for example, how to make an object (In C, something declared with the struct keyword) be allowed to be a member of an arbitrary set of TAILQs. In fact, you need a fairly convoluted definition, let alone complex management code:
1
2
3
4
5
6
7
8
9
10
| /* An actual node, which can be used in arbitrary lists */
struct node {
char *key;
};
/* Separated list structure for managing nodes */
struct node_list_entry {
TAILQ_ENTRY(node_list_entry) node_list;
struct node *item;
}; |
It makes you appreciate Python code like this:
1
2
3
4
| mylist = []
mylist.append("foo")
mylist.append(1)
mylist.sort() |
And after investigating what is involved in getting dictionary-like storage from C (left as an exercise to the reader), code like this:
1
2
3
| mydict = {}
mydict['foo'] = bar
del mydict['foo'] |
-
The TCP/IP stacks in all major operating systems are written in C, and a good number of extremely popular network clients and servers are also (Apache, Sendmail, OpenSSH). One could perhaps even argue that networking is one of the things that C is best suited for, in fact, particularly very low level networking. However, just opening a TCP socket safely is quite a lot of C code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| /* C snippet to connect to a remote host via TCP */
struct addrinfo hints, *res, *res0;
int error, sockfd;
memset(&hints, 0, sizeof(hints));
hints.ai_family = PF_INET;
hints.ai_socktype = SOCK_STREAM;
error = getaddrinfo(host, port, &hints, &res0);
if (error) {
/* handle error */
}
res = res0;
sockfd = socket(res->ai_family, res->ai_socktype, res->ai_protocol);
if (sockfd == -1) {
/* handle error */
}
if (connect(sockfd, res->ai_addr, res->ai_addrlen) == -1) {
/* handle error */
}
freeaddrinfo(res0);
return (sockfd); |
Now compare this to the Python equivalent:
1
2
3
| import socket
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((HOST, PORT)) |
When it comes to HTTP, or other protocols, the difference is even greater. Of course, much of this can be attributed to string and memory handling. To be fair, implementing a basic HTTP/1.0 client in C is not that hard - I did it in under 500 lines of code in Unworkable. However, Python’s standard library - whether via urllib, urllib2 or httplib directly - just makes it at least an order of magnitude less of a headache compared to C.
-
In the realm of threading, it seems pretty clear to me that the POSIX threads (pthreads) interface has won. Of course, the API is available on all POSIX compliant operating systems. I don’t have a huge amount of experience with using it through C - a few years ago I did some very simple stuff with it. While not impossible, it is complicated and tricky enough to deal with. On the other hand, Python offers its own threading module, loosely based on Java’s API. I find it very easy to use threads in Python - perhaps the most glaring feature being that the Python threading module supports both an object-oriented paradigm - where you extend the Thread class with your own - and also a functional approach. The functional approach makes great sense to me - I very much like the idea. Creating a thread like this is as simple as:
1
2
3
4
5
6
7
8
9
| # Simple Python threads example, using functional paradigm
import threading
def worker():
while True:
# do work then break
break
t = threading.Thread(target=worker)
t.start() |
-
File I/O is an area where straight C really isn’t too bad. You have your POSIX interface, via open(2), read(2), write(2), etc - and you have your ANSI buffered I/O functions with fopen(3), fread(3), fwrite(3), etc. Many of the shell commands for file system manipulation map very closely to libc calls. For example, mkdir(2), dirname(3), stat(2) and so on. Python - once again mostly thanks to being able to handle the memory management for you - helps a lot in the situation where you are reading from a file, of which the size is unknown (for example, a pipe, or a network socket).
I would also mention that Python’s standard library has a concept of ‘file-like objects’ which are essentially opaque data buffers which can be accessed through exactly the same interfaces as actual files. Common examples are StringIO, urllib and urllib2.
-
When it comes to working with databases, Python has the usual advantage of making it easy to deal with dynamic result sets. Additionally, abstractions like DB API 2 and some of the advanced language features such as list comprehensions and generators, can greatly reduce the amount of code required for filtering and processing data from databases. Furthermore, I have found that psycopg2 (the website of which is unfortunately in bad shape) works extremely well in a threaded environment.
In conclusion, Python allows you to write complicated, useful applications, with fewer bugs, much faster than in C. It removes many (but not all) headaches associated with memory management and data structures. Much of the portability issues are taken care of for you. Essentially with Python you stand on the shoulders of giants. While C is still extremely useful and important, Python makes excellent sense for many classes of program.