Drupal: is that site hacked?

Some months back I was in the need to know if a Drupal site was hacked how much code had been modified in a given Drupal site. It was not straightforward to install a copy of the site out of the box (it had hardcoded absolute paths in custom modules among other annoying things).

http://dgo.to/hacked in conjunction with http://dgo.to/diff is a well known solution to guess if core or contrib are hacked have been modified. This solution is based on a site able to bootstrap and run. As said before, it was difficult to install the site locally. In fact I rejected to spend my time making it work and dealing with its cumbersome admin area. I didn't need to get the site working, just needed to know how much it was hacked, so I did think on a solution based on Drush's make.

The reverse feature of Drush's make is to generate a makefile from a given site. Simply drush generate-makefile sitename.make. The idea was to get a fresh copy of the same core and projects version and compare them to detect what files had been altered.

So the approach is as follow:

  1. Generate a makefile from the original site. The resulting makefile is a description of the projects (and their versions) used to build the site. This makefile may not be complete in some cases. For example makefile-generate doesn't consider libraries or external projects not hosted in drupal.org. It may also lack effectiveness if projects were git clones (and git_deploy is not enabled) or even worse cvs checkouts. So you may need to edit the resulting makefile and adjust it.
  2. Run the makefile to another location in order to get a vanilla copy of the same code.
  3. Compare both directory trees. Here you can use two different approachs: unix's diff or a python script I did for this same purpose.

And here are the basic command lines:

:/var/www$ cd htdocs-original
:/var/www/htdocs-original$ drush generate-makefile ../hackedsite.make
:/var/www/htdocs-original$ cd ..
:/var/www$ drush make hackedsite.make htdocs-vanilla
:/var/www$ diff -r -q --exclude=sites/default/files --exclude=translations htdocs-original htdocs-vanilla

Here's a snippet of the output I get:

Only in htdocs-vanilla/sites/all: README.txt
Only in htdocs-original/sites/all/modules: artistas
Files htdocs-original/sites/all/modules/calendar/CHANGELOG.txt and htdocs-vanilla/sites/all/modules/calendar/CHANGELOG.txt differ
Files htdocs-original/sites/all/modules/calendar/LICENSE.txt and htdocs-vanilla/sites/all/modules/calendar/LICENSE.txt differ
Files htdocs-original/sites/all/modules/calendar/calendar.css and htdocs-vanilla/sites/all/modules/calendar/calendar.css differ

In the above example I used unix's diff. My python script works esentially the same but provide some advantages and other features.