In this post we will share high level (technical) concepts involved in the creation and implementation of the droxIT story. The following sections are each devoted to one processing step which will be explained in logical order - from data to visualization. Gaining insights from the vast amounts of information that are easily available over the internet requires a certain amount of processing steps. A set of unordered documents needs to be organized into a system that allows the extraction of highlights to be condensed into something we can understand at a glimpse.
clinicaltrials.gov has a search based interface which allows you to narrow down a subset of studies that might be of interest but no means of downloading the whole set. Hence crawling the individual studies is the only way to acquire the data. Fortunately all studies are accessible via this url: https://clinicaltrials.gov/ct2/results?pg=1
By extracting the study links from the page and increment the page query parameter all studies can be easily be downloaded with an automated HTTP client - in our case Python's standard client.
The crawling leaves us with a set of HTML pages - a format that contains a lot information that is of no use of us. To ease further processing of the data we want to extract the relevant information and store it in a JSON format.
The individual study pages can be viewed in a tabular presentation that will embed the interesting parts in a <table> element.
We used the lxml package to write a HTML parser that extracts the textual data from the tables rows and creates tuples of section headers and content. The headers are the first columns content followed by the content in the second column.
These can easily be saved in a JSON dictionary for further processing.
From now on we will call the newly created documents our raw data and completely ignore the HTML files.
The first step in the data's analysis is the finding of potential collocations: fixed combinations of words (in our case two or three of them) that have bear significance - terms like 'Hepatitis C' or 'immunologic reaction'.
For this step we first need to tokenize our content - turn it into a list of textual entities. Valid tokens are words and abbreviations (U.S.A), punctuation marks and braces and numbers. Next Bayesian statistics are leveraged to ascertain if a series of words are likely to be in this sequence like e.g. many noun-verb combinations (e.g. build a house) or if they are indeed a collocation.
The process delivers many candidates that require filtering. Collocations can theoretically be sequences of arbitrary length and don't have to be all adjacent to each other in the text. For simplicity and performance we only look for collocations of two or three words (bigrams and trigrams) who's terms appear next to each other. First all collocations who's occurrences are below a threshold are discarded because their statistical significance might be due to their low numbers as well as any candidates containing a number or a punctuation mark or braces. Secondly all bigrams that contain any stop words are discarded. In our case stop words are the words that are most frequent in natural language and make up approximately 70% of everything we write (e.g. 'the', 'of', 'a'). When dealing with trigrams the situation becomes trickier. Terms like 'University of Neverland' would be discarded using the aforementioned approach. So in this case we filter out only trigrams that contain stop words at the beginning or end.
Finally we only take the candidates that have the highest probabilities to be collocations.
The next step is part of speech tagging: taking sentences and annotate it's parts with their function - are they verbs, nouns, adjectives, names etc. This step would normally be done before the collocation detection because it can enhance the process by allowing to filter combinations of speech parts. Unfortunately the dataset we're examining does not particularly lend itself to POS tagging because only very few fields hold complete sentences. Most encountered pieces of text are partial sentences at best.
The reason we still perform this step is because it is part of entity recognition. Any names that were tagged are potential entities that can possibly be categorized. The results are still shaky and should be supported by an entity database in the future.
In order to get a bearing of our data without tainting it with any preconceptions or knowledge about the shape of the individual study files we simply count the frequencies of the terms contained in a document.
To do that we take only the words and abbreviations found after tokenizing our textual data and count their frequency across the whole document they appear in as well as the sections they appear in.
We chose Elasticsearch as our data storage and analysis solution. In order to facilitate efficient queries we opted for a parent - child relationship within our index structure. We indexed our raw documents as parents and the individual terms we extracted as children. Specifically we indexed the terms and collocations multiple times - once for with their count across the whole document and once per section they appear in and the number of times it appear within that section.
The rationale behind this structure is to allow queries that need the whole document e.g. finding all documents containing a set of terms and then be able to aggregate word frequencies with the associated child documents. Another benefit is the ability to associate results of future analysis with the same parent documents and hence incrementally enhance our query capabilities.
To deliver our results to any front ends we wrote an API server in Node.js. It offers a RESTful interface to the back end services. Since we rely on a highly granular micro service architecture to deliver our results, the API's endpoints are freely configurable at startup. This offers a great amount of flexibility when it comes to deployment schemes especially in the cloud.
The modular approach allows a high degree of independence between the individual modules. This allows easy introduction of new services and the ability to change almost anything within a module without impacting the rest of the architecture.
The last piece of our stack is our front end. So far this entails visualizations on this website.
For starters we opted for D3js for our data visualization tasks. It's a very powerful and flexible framework that delivers stunning visualizations.
Creating a full stack of data processing goodness needs a lot of expertise in different areas in order to provide the desired results. Aiming for the full stack early on instead of perfecting a single stage gives you the flexibility to react to any obstacles or inspirations you may encounter during development and allows you to see any changes you make reflected in the final result. It's an ongoing evolution instead of a grind towards a distant goal - the very embodiment of agile development.