Secret Life of HTML

Skip to end of metadata
Go to start of metadata


"Secret Life of HTML" takes the form of a combination poetry repository and inverse search engine. The term "inverse search engine" refers to an searching application that indexes exclusively the elements of the web that a standard search engine ignores--the commented-out text that is not rendered by the browser and left invisible when an Internet user accesses a webpage. The back-end is a database of roughly seven million commented-out lines of HTML scraped from the web. On the front-end, a user can search the database for lines matching a certain term or expression and see only commented-out HTML with that term somewhere in the text. Users are encouraged to use lines from the scrape to make their own poems for upload and display on the site, and they can browse through nine poems that I wrote using the comments.


Why integrate the poetry repository with the "inverse search engine?" I found it necessary to write a framework for locating lines of interest and the project naturally evolved into the development of a platform for exploring the secret life of HTML documents in a systematic and robust way. The principle of opening up the source of a document for use in art was in conflict with my idea of using a physical medium such as a book, and I decided that it was more important to keep my source open and to write an interface for the comment-exploring platform to be used in a participatory way. There is certainly something intrinsically poetic about the way that HTML comments exist in the source of web documents without being a part of the visible experience, so the medium of poetry seemed right for transforming HTML comments into a structured artistic form.



I scraped the HTML using Heritrix, an open-source, Java-based crawler that can take configuration and seed specifications. I crawled from a number of major poetry repositories and from Rhizome, but I ultimately had to augment my results with about 14 gigabytes of example files hosted on the Internet Archive because my results were disappointingly diffuse in terms of valuable commented-out text. I processed the output files with a perl script which used a regular expression to decide which lines were comments. This is the file that is searched when the search-engine front-end is invoked. The search algorithm is trivial and was written in PHP. The rest of the site is HTML, Javascript and basic PHP, and the primary challenges on that end were the varying encoding styles of the input and finding the correct way to process multi-lingual, multi-encoding, commented-out HTML for display.


If I were doing this project in another capacity, I'd love to be able to mimic a real search engine more closely--to be able to link to a range of documents and tie comments directly to the source URL. I'd also love to have a broader and more dynamic slice of the web to search. I think the idea of an "inverse search engine" is interesting, but was not feasible to fully actualize on the scale of this project. The original idea was that, by searching, poems would generate themselves, but the output of the engine isn't really structured enough to be satisfactory in that capacity. To take things to the next level, a more sophisticated poetry generating algorithm would enhance the poetry aspect of the project a lot.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.