Dropbox search capabilities have been given a significant boost for the second time in as many months. The company says that it can now search for text inside PDFs and even image files like JPG and PNG …
Dropbox search became much more powerful last month, when the company deployed a new engine based on machine learning. The company says that it is now bringing optical character recognition (OCR) capabilities to search for the first time.
Image formats (like JPEG, PNG, or GIF) are generally not indexable because they have no text content, while text-based document formats (like TXT, DOCX, or HTML) are generally indexable. PDF files fall in-between because they can contain a mixture of text and image content. Automatic image text recognition is able to intelligently distinguish between all of these documents to categorize data contained within.
So now, when a user searches for English text that appears in one of these files, it will show up in the search results.
The Verge notes that the feature is, however, limited to the more expensive subscription tiers.
The new feature works with English text and is available now to Dropbox Business Advanced and Enterprise users, and should be available to Dropbox Professional subscribers in the coming months.
It uses the same technology first deployed in the company’s mobile app last year. If you used the app to photograph a document, it would run OCR at the same time, pulling out the text. But that only worked on a small subset of your documents.
By implementing OCR capabilities directly into the search engine, Dropbox can now search text within all of your PDF and image files, no matter how they were scanned or photographed.
The company says this new Dropbox search feature will make a huge difference to users.
The potential benefit of automatically recognizing text in images (including PDFs containing images) is tremendous. People have stored more than 20 billion image and PDF files in Dropbox. Of those files, 10-20% are photos of documents—like receipts and whiteboard images—as opposed to documents themselves. These are now candidates for automatic image text recognition. Similarly, 25% of these PDFs are scans of documents that are also candidates for automatic text recognition.
The company says that the computing-intensive nature of the OCR process within Dropbox search means that it did need to impose one important limitation.
Some PDF documents have a lot of pages, and processing those files is thus more costly. Fortunately, for long documents, we can take advantage of the fact that even indexing a few pages is likely to make the document much more accessible from searches. So we looked at the distribution of page counts across a sampling of PDFs to figure out how many pages we would index at most per file. It turns out that half of the PDFs only have 1 page, and roughly 90% have 10 pages or less. So we went with a cap of 10 pages—the first 10 in every document. This means that we index almost 90% of documents completely, and we index enough pages of the remaining documents to make them searchable.
My colleague Bradley Chambers recently explained the three reasons he left Dropbox for iCloud Drive and never looked back. Personally, however, Dropbox remains my primary cloud storage, mostly because I find it syncs far faster than any of the many alternatives I’ve tried.