Apache Tika VS Boredom. Bypassing Arbitrary File Upload Restrictions
As we were preparing some stuff for one of our clients, they requested for us to provide some insight in how to avoid arbitrary file uploads.
Apart of the usual “just use a white-list for the allowed extensions” approach, he wanted us to focus on analyzing the uploaded files to be able to determine whether if these should be allowed or not.
To achieve it, they suggested an Apache library called Tika.
Apache Tika(TM) is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Taken from its GitHub repo.
It seems like a robust method to perform file parsing and metadata detection, however, when used as a file type detection mechanism, is easy to make it fail miserably.
Let’s take a look.
Setting up the environment
It’s fairly easy to get it running, just clone the repo and let Maven do its thing.
1
2
3
git clone https://github.com/apache/tika.git tika-trunk
cd tika-trunk
mvn install -DskipTests
Go get a coffee, Maven is gonna take some time fetching the dependencies and building your JAR… In fact, grow your own coffee, recollect it, toast, grind and brew it yourself. Maybe once you’re done Maven is finished as well.
Now change to the JAR file folder and execute the Tika server:
1
2
cd tika-server/target
java -jar tika-server-2.0.0-SNAPSHOT.jar
Note: The tika-server version may vary on your environment.
If successful, you should see something like this:
Good, so it’s up and serving at localhost:9998. Heading to that direction shows the Apache Tika Server endpoints with all the supported methods and a brief description of what to expect from these.
Giving Tika a try
The first endpoint for the Tika Server API is the one we’re interested on: /detect/stream
It allows to PUT
a file and it should detect the kind of file it is receiving. Fairly simple and straightforward.
All right, let’s try it.
1
curl -T some_random_img.png http://localhost:9998/detect/stream -v
Which should return something like this:
Nice and simple! Tika detected the file we uploaded as an PNG image. So far so good.
But, is this always right? Is Tika always correct when detecting the file type?
Tika take this
A common way to bypass upload filters that rely on the so-called Magic Bytes is adding the typical GIF87a;
or GIF89a;
header to the restricted file to upload.
Tika is no different from other tools, and will mess it up with the types as soon as those bytes are added to the file:
However, Tika is not yet defeated. There’s another endpoint that we should be able to bypass.
/tika to the rescue
Attempting to PUT
that very same file to the /tika
endpoint will return a fairly different result:
Humm… no like?
It seems the /tika
endpoint tries to parse the whole file, and realizes it is something weird.
To bypass it, by prepending a whole GIF before the code, would do the trick:
And then attempting to PUT
it against both endpoints:
Nice! Tika’s checks were bypassed :)
Conclusions
Sometimes, the big names in the game give us a false sense of security.
For sure Apache Tika is a great utility to perform metadata analysis and, in some environments, file detection. However, when it comes to security, it shouldn’t be relied on as the only method to avoid unrestricted file uploads.
This, rather simplistic, and small article is the result of a few minutes of research taken in order to better advice one of our clients.