UPDATE 02/10/2014: Added Section on Sessions
The web is awesome. As a platform, it has such a low barrier to entry, that nearly any one can throw up a web page in an afternoon, and this fact has enabled probably one of the greatest information renaissances of human history.
But, this low barrier to entry is not without its disadvantages. Because it can be so easy to build something for the web (especially with WYSIWYG editors and great IDEs like Visual Studio, where you can create even a fairly complex application just dragging and dropping stuff onto a page), I come across more and more "developers" who don't even have a basic understanding of the fundamental components of the web. The web is a platform, and just like any other platform (OS X, Windows, Linux, iOS, Android, etc.), you need to understand how the platform works in order to develop quality applications.
So, here, I'm compiling a list of what I feel are bare-essential topics every would-be web developer should know and understand:
HTTP is the foundational protocol on which the entire web is based, so it's only appropriate to start here. HTTP is an acryonym for "HyperText Transfer Protocol", and it's essentially the playbook for how web servers and web clients interact with each other. The current working version is HTTP 1.1, but HTTP 2.0 is on the horizon, which will integrate many aspects of the SPDY protocol, developed by Google.
An HTTP Client is an agent that issues an HTTP request. The most well-known HTTP Client is, of course, the web browser (Internet Explorer, Chrome, Firefox, etc.). An HTTP Server is an agent that responds to an HTTP request. If you've ever heard of Apache, Nginx, IIS, etc., these are all HTTP Servers.
However, it is important to realize the the client-server relationship is not set in stone. A server can often be a client itself and vice versa. The terms "client" and "server" apply to whatever is requesting or responding, respectively, in the given context.
An HTTP session is initiated with a request. For example, when you type a URL into your browser's address bar, you are instructing your browser (the client) to issue a request for a resource on a server. The server receives the request and sends back a response to the client, in this case, an HTML document.
Clients issue requests via one of several "verbs". The most common are
POST, but there's also
PATCH. The verb used both describes the request being made and instructs the server on how to respond. For instance,
HEAD tells the server that all the client needs is the response headers, while
GET would necessitate a response that include both headers and a body.
POST indicates that there is a request payload: data the server should use in order to serve the request and create a response.
Requests are issued with headers. These are just name-value pairs that serve as hints to the server for constructing a response. The term "hints" is used here because the server is not forced to comply. For example, since I'm in the United States and a native English speaker, my browser sends the request header,
Accept-Language: en-US. If I access a U.S. web site, it will of course happily respond with a web page in English, but a Chinese website may not have a version in English for me, and therefore, would still respond with a web page in Chinese. However, if the Chinese site was, indeed, localized (has versions for other languages than the main server language), I would receive the page in English, as I requested.
Response headers are similar to request headers in that they, too, are just name-value pairs. However, these serve as hints to the client, rather than the server. One common example would be the
Expires header, which tells the client how long the resource is "good" for. If the expire date is sufficiently far in the future, the client may decide to cache the response from the server so that it does not need to request it again later. Again, compliance is not forced; the client may choose not cache now or ever.
The HTTP protocol delinieates a set of status codes that describe a server's response. Just about everyone who has drawn air in the last 20 years is familiar with
404 Not Found, an error code that means the requested resource could not be located on the server. Status codes are defined in ranges:
4XX- Client Error
5XX- Server Error
HTTP is a stateless protocol. This means that each request and response lives in its own universe. It doesn't matter if the client has requested 1 resource or 100 from the server the next request is always as if it was the first ever. Closely related are the topics of sessions and cookies, which provide the illusion of state to web applications.
When the client issue a request, a connection is opened to the server. Originally, connections were not persistent; once the server sent the response, the connection was closed, and would need to be reopened with the next request. HTTP/1.1 introduced keep-alive which allowed connections to remain open for some defined period of time, to reduce request latency (the time it takes to contact the server and initiate a connection). However, it's best to consider a connection as closed after the response is received from a philosophical standpoint.
REST, short for "REpresentational State Transfer", is separate from but very closely related to HTTP. It is formally a networked-application architectural style that centers around HATEOAS (Hypermedia As The Engine of Application State), which is just a fancy way fo saying that clients and servers should interact entirely via hypermedia, a logical extension of the term "hypertext" (as in HyperText Markup Language, or HTML), that includes non-textual media such as video, audio, etc., as well. REST is technically protocol-agnostic, but in nearly all cases, it uses HTTP. It is defined by constraints, which include:
- Client-Server: There is clear separation of concerns between a service and a consumer determined by a technical contract they share.
- Stateless: Services do not store context between request/response exchanges with consumers, and every request must contain all information necessary to service the request.
- Cacheable: Consumers can cache responses from services, reducing unnecessary communication and promoting scalability and performance.
- Uniform Contract: Consumers access service resources in a standardized way that promotes composition and reusability.
- Layered System: A solution is composed of layers such that the consumer does not need to connect directly to the service, promoting scalability through load-balancing or shared caches.
- Code-on-Demand (optional): Services may defer execution of logic by transferring executable code, allowing features to be added to consumers without requiring a formal upgrade.
It's all a little heady, honestly, but it's a core principle of how the web and HTTP in general operates.
See: Architectural Styles and the Design of Network-based Software Architectures by Roy Thomas Fielding
As discussed in the HTTP section above, the HTTP protocol is stateless. Each request is unique: it is not affected by any previous request and must contain all information necessary for the server to service the request. However, applications need state, so sessions were created as a way to layer state on top of HTTP.
A session exists in two parts, a server-side piece that's persisted in some way (usually a database) and client-side piece, usually in the form of a cookie. In the typical scenario, the application on server will create a new record in the session table of its database. That record will have an id, referred to as the session id. The server will then send the reponse to the client, along with a
Set-Cookie response header, which will usually just contain this session id. When the client, in this case a web browser, receives the response, it sees the
Set-Cookie header and uses it's value to store a text file on the user's filesystem that holds this value. This file is often called a cookie, but the cookie is really the data in the file. On the next request the browser makes to this server, it sees that it has a cookie, so it sends the cookie back to the server along with the request using the
Cookie header. The application on the server reads this header and uses the session id it includes to load the appropriate record from the session table, restoring the state of the last request.
Session Security (or Lack Thereof)
Since cookies are just plain-text values, they present a security hole in applications. One should never send anything as a cookie other than the session id, and definitely not anything sensitive like a username or password. However, even with just the session id exposed an application is vurnerable to what's known as session hijacking.
Since HTTP is stateless and all information needed to service a request must be sent with the request, requests can be replayed. In other words, any client can send the exact same request and get the exact same response. An attack of this nature is called a replay attack and covers more than just session hijacking. However, based on this idea, that if you can recreate the same request, you can get the same response, all one needs to steal a session is the cookie containing the session id.
HTTPS, or HTTP Secure, is a wrapper around HTTP to help prevent things like session hijacking and replay attacks. Instead of sending plain-text request and responses, the server and client will encrypt their communication before sending it over the wire. This encryption requires what's called a public key and private key. Information is encrypted via the public key. This key is, well, public, so it is not, itself, secure. However, security is not a concern with encrypting data, just in decrypting. As a result, the private key is used for decrypting data, and is not shared with the rest of the world. Only the end that needs to decrypt the data should know the private key.
When an HTTPS connection is established, the client and server must complete a handshake, where they essentially just trade public keys. The client will then use the server's public key to encrypt the request it sends. The server receives the request and uses its private key to decrypt it and then processes it as normal. Then, the server will use the client's public key to encrypt the response. The client receives the response and uses its private key to decrypt it.
In the case of a cookie, this means that it is no longer transmitted in plain text, which makes it harder (harder, but not impossible) to hijack, as one now much be able to attain either the server or the client's private key in order to get the value.
Secure and HttpOnly
The one big problem with sending cookies over HTTPS is what happens when they aren't sent over HTTPS. While an HTTPS connection is open, all communication is encrypted, including cookies, but if the connection is switch to plain HTTP, the client and server still send the cookie headers, only now in plain-text again. To fix this, the
Set-Cookie response header has the
Secure attribute, if this text is in the
Set-Cookie header string, then the cookie that is being set will only be sent over HTTPS. If the connection is switched to plain HTTP, then the client stops sending the cookie, so it is never exposed as plain text.
HttpOnly attribute works in the same way but prevents cookies from being sent over non-standard HTTP channels like AJAX. It's a little inappropriately named, though, since AJAX is still communicating over HTTP. Nevertheless, cookies are blindly sent with every request, just in case they are needed. With the advent of AJAX, this created a very easy security hole. All a malicious script needed to do to obtain a cookie for a particular site was send an HTTP request to any page on that site, including unsecure pages, like a home, about, or contact page. If the user had already had previous communication with that site, their cookies would automatically be sent along, and then the malicious script could steal the data. The
HttpOnly attribute puts a stop to this by instructing the client to never send cookies in this scenario.
Somewhere along the line, cookies became a scare-word plastered across nightly news stations. Despite what many of those clueless reporters parroted incorrectly, cookies are not harmful. They are not scripts. They're just data, and in most cases they're data that only has any meaning within a particular context (like a session id). However, because some people disable cookies or cookies may still not be supported for some other reason on a client, the concept of a "cookieless" session was created. In this scenario, data must still be transferred back and forth between the server and client in order to maintain a sense of state, but instead of using cookies, the session id or other data is placed into the query string of the request. However, there's one huge flaw with this approach. The query string, as it's part of the URL, is always plain-text even over HTTPS. So if one merely goes to the same URL, the session is instantly hijacked. It is not recommended to use cookieless sessions, for any reason.
Stay tuned. More to come.