Writing Genie queries

Genie is a query language for the Database of Experimental Results. It resembles a well-known database language called SQL which is popular for relational databases; however Genie has been enhanced to work with our nested data model in a manner similar to the languages described in papers by Roth et al.(1) and Pistor et al.(2).

Genie is still under construction; some common database operations such as projection and join have not been implemented yet, but it is already useful.

An example
Ground rules
Criteria
More examples
How to read the output
Conclusion
References

An example

Here is a simple example to illustrate the general idea:

Ex. 1	QUERY reference_info
	WHERE year >= 1990

This query is asking for the literature citations of all the papers in the database which were published during or after 1990. The first line identifies which table is to be searched, and the second is specifying the criteria that each item must meet in order to be selected. In this case, only records which have a year value greater than or equal to 1990 will appear in the result.

As you can see, it is essential to have a copy of the current schema handy in order to know what fields are available for each table. Please read Understanding the schema if you have not already done so; it explains what the schema is for and how to interpret it. The schema is expected to change from time to time in response to the needs of the user community, so be sure to use the current version.

Ground rules

The QUERY clause identifies which table is to be searched. Currently you can only query one table at a time, but a future version of Genie will be able to deal with several tables simultaneously.
The WHERE clause specifies the criteria that each record must meet in order to appear in your result. It can be omitted if you don't want to impose any restrictions. The criteria can be simple or quite complicated; see the next section for a more detailed discussion.
Currently there is no way to print just a few fields from the selected records; you automatically get the entire record in your output. A future version of Genie will be able to perform this type of operation using a REPORT clause.
Genie is now case-insensitive in most situations. This means that it does not matter whether you type your query using upper or lower case letters. Extra spaces, blank lines, and indentation generally don't matter either (except in the middle of a word, of course).
All literal constants except numbers must be enclosed in single quotes (apostrophes), 'like this'. A literal constant is any explicit mention of a specific value for a field. The quotes are needed so Genie can distinguish literal constants from field names.
You can use a # sign to indicate that the rest of a line is a comment, for humans only, that Genie should ignore.

Criteria

Simple fields

Recall from the Understanding the schema document that a field can have either a simple or composite data type. Simple types contain just a single value, which can be String, Integer, or Real. Criteria for simple fields look like this:

	fieldname  op  value

where op can be any of the comparison operators =, !=, <>, >, >=, <, <= (both != and <> mean not equal to, so you can use whichever you prefer.) Value can be a literal constant (remember to put single quotes around it, if it's a String) or even another field name. There are also two special values, INFINITY and NULL, which specifies a missing or blank value.

Integer and Real values will be compared numerically, and in addition to the simple comparisons listed above, you can also build arithmetic expressions with them using +, -, *, and /, as well as () for grouping.

String values are compared alphabetically (like in a dictionary), except that all blanks and underscore characters are ignored, and there is no distinction between upper and lower case. But sometimes you just want to search for a particular word, or part of a word, without trying to match the whole value exactly; this is especially true for long text descriptions. Genie deals with this situation by providing a special function called CONTAINS, which determines whether or not the first string you mention contains the second as a substring. To use it, write something like this:

	CONTAINS (string1, string2)

where string1 and string2 can be either field names or literal constants (remember quotes!). Again there is no distinction between upper and lower case, but this time all punctuation is ignored in addition to blanks and underscores. That is, only letters and numerals are used in the search.

Records

Criteria for Records usually need to refer to the subfields within the Record. To do this, write the name of the Record followed by a dot (period) and then the name of the subfield you want, e.g.

	region.origin

There may also be occasions when you are dealing with the Record as a whole (e.g. see the section below on Overlaps and Covers), and you need to build a literal constant for the whole Record. This is accomplished using square brackets [] to represent the Record grouping with the field values listed inside, in the same order as they appear in the schema, separated by commas. For example, a literal DNA region could be written as

	['humhbb_orig', 8658, 8677]

Variants

Variants are treated in much the same way as Records, except that you need to establish which "flavor" you are talking about before you start referring to subfields that belong to a particular flavor. This is done using the IS operator:

	DNA_fragment IS beta_g_region

Quotes are not needed around the flavor because it is not a literal constant.

Sets and Lists

Dealing with Sets and Lists is the trickiest part of writing Genie queries. Since we are using a nested data model that produces a nested schema, we have to expect nested queries, too. The key thing to remember is that a subquery can be used anywhere a field name or table name could be used. That is, any query can be enclosed in parentheses () and nested inside another query, because when the subquery is evaluated the result is a table with rows and columns, which is no different from a Set. It could just as easily have been a value for some field with a Set datatype.

Currently there are two operators that work on Sets and Lists: EXISTS, which determines whether or not the following Set or List has any rows in it, and COUNT, which finds out exactly how many rows there are. Whenever you see in the schema that a field you want to specify lies inside a Set or List, you need to open a new nesting level by writing something like

	EXISTS (
	    QUERY setname
	    WHERE ...
	)

Between the parentheses, the subquery is in its own little world, where setname is taking the role usually played by a top-level table. You might think that to refer to the subfields inside setname you would need to write setname.subfield the way you do for records. However, this is not correct because the QUERY line takes care of establishing the setname for the whole subquery, the same way it does for top-level tables. Thus it's enough to write just subfield.

Boolean operators

A WHERE clause at any nesting level can contain multiple criteria linked together by the standard Boolean operators AND, OR, and NOT, as well as () for grouping.

Overlaps and Covers

In addition to the general kinds of criteria discussed above, Genie also provides some specialized functions for dealing with DNA fragments. Currently there are just two: OVERLAPS, which determines whether or not two beta-globin regions overlap at all in the multiple alignment, and COVERS, which determines whether the first region completely contains the second one. These are useful for specifying the general vicinity of a fragment in the multiple alignment, without mandating its exact endpoints. To use them, write something like this:

	OVERLAPS (region1, region2)

where region1 and region2 can be either field names or literal constants.

More examples

This query retrieves all binding assay experiments contributed by Ahmed ElSherbini that included methylation interference assays with K562 extract using a probe from HUMHBB, where a protection effect was observed.

Ex. 2	# This is a comment.
	QUERY binding_assay
	WHERE CONTAINS (contributor, 'sherbini')
	  AND assay IS methylation_interference
	  AND assay.probe IS beta_g_region
	  AND assay.probe.region.origin = 'humhbb_orig'
	  AND assay.source IS nuclear_extract
	  AND assay.source.cell = 'K562'
	  AND EXISTS (
	        QUERY assay.regional_effect
	        WHERE effect = 'protection'
	      )

Here is a query to retrieve all conserved sequences from an area corresponding to the human HS2 region.

Ex. 3	QUERY conserved
	WHERE OVERLAPS (region, ['humhbb_orig', 8486, 8860])

This one finds all the conserved sequences in the database that are at least 25 bases long.

Ex. 4	QUERY conserved
	WHERE region.stop - region.start + 1 >= 25

This example retrieves all binding assay experiments where the probe (for gelshifts) or one of the regional effects (for non-gelshifts) overlapped the AP1 region (corresponding to 8658-8677 in the human sequence).

Ex. 5	QUERY binding_assay
	WHERE ( assay IS gelshift
	        AND assay.probe IS beta_g_region
	        AND OVERLAPS (assay.probe.region, ['humhbb_orig', 8658, 8677])
	      )
	      OR
	      ( assay IS NOT gelshift
	        AND EXISTS (
	              QUERY assay.regional_effect
	              WHERE OVERLAPS (region, ['humhbb_orig', 8658, 8677])
	            )
	      )

This query retrieves all DNA transfer experiments whose construct included a wild-type segment corresponding to part of the HS4 region in humans.

Ex. 6	QUERY DNA_transfer_experiment
	WHERE EXISTS (
	        QUERY construct
	        WHERE EXISTS (
	                QUERY construct_segment
	                WHERE segment.DNA_fragment IS beta_g_region
	                  AND NOT EXISTS (
	                            QUERY segment.DNA_fragment.mutation
	                          )
	                  AND OVERLAPS (segment.DNA_fragment.region,
	                                ['humhbb_orig', 951, 1234])
	              )
	      )

This one retrieves all DNA transfer experiments whose construct contained an analog of the entire human HS2 core, but used a non-globin reporter gene.

Ex. 7	QUERY DNA_transfer_experiment
	WHERE EXISTS (
	        QUERY construct
	        WHERE EXISTS (
	                QUERY construct_segment
	                WHERE segment.DNA_fragment IS beta_g_region
	                  AND COVERS (segment.DNA_fragment.region,
	                              ['humhbb_orig', 8486, 8860])
	              )
	          AND EXISTS (
	                QUERY construct_segment
	                WHERE segment.DNA_fragment IS NOT beta_g_region
	                  AND EXISTS (
	                        QUERY segment.feature
	                        WHERE feature_element IS reporter
	                      )
	              )
	      )

The following query retrieves all DNA transfer experiments whose construct contained a segment with a beta-globin region that overlapped the AP1 region (corresponding to 8658-8677 in the human sequence) and was completely contained in the region corresponding to 7750-9230 in the human sequence, but no other segments overlapped the LCR (corresponding to anything < 15,000 in the human sequence). In addition, one of the segments in this construct must have featured the epsilon-globin gene as a promoter.

Ex. 8	QUERY DNA_transfer_experiment
	WHERE EXISTS (
	        QUERY construct
	        WHERE EXISTS (
	                QUERY construct_segment
	                WHERE segment.DNA_fragment IS beta_g_region
	                  AND OVERLAPS (segment.DNA_fragment.region,
	                                ['humhbb_orig', 8658, 8677])
	                  AND COVERS (['humhbb_orig', 7750, 9230],
	                              segment.DNA_fragment.region)
	              )
	          AND COUNT (
	                QUERY construct_segment
	                WHERE segment.DNA_fragment IS beta_g_region
	                  AND OVERLAPS (segment.DNA_fragment.region,
	                                ['humhbb_orig', -INFINITY, 15000])
	              ) < 2
	          AND EXISTS (
	                QUERY construct_segment
	                WHERE EXISTS (
	                        QUERY segment.feature
	                        WHERE feature_element IS promoter
	                          AND feature_element.gene = 'epsilon_globin'
	                      )
	              )
	      )

Here is one that retrieves all DNA transfer experiments using a wild-type construct (i.e., no mutations, and all segments were from the beta-like globin cluster except possibly reporters), with some segment overlapping the HUMHBB region from -2686 to 10410, where experiments were performed measuring expression levels in transgenic mice.

Ex. 9	QUERY DNA_transfer_experiment
	WHERE EXISTS (
	        QUERY construct
	        WHERE NOT EXISTS (     #no beta_g_regions with mutations
	                    QUERY construct_segment
	                    WHERE segment.DNA_fragment IS beta_g_region
	                      AND EXISTS (segment.DNA_fragment.mutation)
	                  )
	          AND NOT EXISTS (     #no non_beta_g_regions or oligos, except reporters
	                    QUERY construct_segment
	                    WHERE segment.DNA_fragment IS NOT beta_g_region
	                      AND NOT EXISTS (
	                                QUERY segment.feature
	                                WHERE feature_element IS reporter
	                              )
	                  )
	          AND EXISTS (         #some segment overlaps the given region
	                QUERY construct_segment
	                WHERE segment.DNA_fragment IS beta_g_region
	                  AND OVERLAPS (segment.DNA_fragment.region,
	                                ['humhbb_orig', -2686, 10410])
	              )
	      )
	  AND EXISTS (
	        QUERY experiment
	        WHERE type_of_assay IS transgenic_mouse
	          AND EXISTS (
	                QUERY type_of_assay.result
	                WHERE measurement IS expr_level
	              )
	      )

How to read the output

Your query output will come in several sections. The first line will report how many rows were selected by your query; this does not count the nested subtables, just the top level rows. E.g., how many DNA transfer experiments or binding assay experiments matched your criteria?

The next part will display the selected rows using the same language that Genie uses for data input. This language uses statements of the form

	fieldname = value

for simple data types, and nested groups of the form

	BEGIN fieldname
	   |
	   |
	   |
	END fieldname

for Records and for each row of a Set or List. Variants also include the tag to identify which "flavor" was used:

	BEGIN fieldname AS tag
	   |
	   |
	   |
	END fieldname

The data language is relatively simple to understand compared to the schema or query languages, but it can be very tedious to read, especially if there are a lot of records. To make it easier to see what you've got, the Globin Gene Server provides a facility for drawing your output in a graphical format. Currently it can handle either DNA transfer experiments or binding assays .

Finally, the last part is a schema for the selected records, which (in the current version of Genie) is just a restatement of the table you were selecting from. This can be used by other programs that read and further process your output, like the drawing programs for example.

Conclusion

We hope that you will find this Database of Experimental Results to be a useful and powerful tool. Of course, it is only as good as the data that is put into it, and so we encourage you to share your own experimental results by sending them to us via electronic mail for inclusion in the database.

Questions, suggestions, and comments are welcome; please contact Cathy Riemer.

References:

(1) Mark A. Roth, Henry F. Korth and Don S. Batory, SQL/NF: A Query Language for non-1NF Relational Databases, Inform. Systems 12.1: 99-114 (1987)

(2) P. Pistor and R. Traunmueller, A Database Language for Sets, Lists and Tables, Inform. Systems 11.4: 323-336 (1986)

6/96

Writing Genie queries

Table of Contents