Usage

The typical usage pattern will be something like the following. In the below example, we are finding those sentences (identified by Genia) that are contained in a ce:para XML element.

# Get a SparkSession
spark = pyspark.sql.SparkSession.builder.getOrCreate()

# Set a reasonabe value for shuffle partitons (help with join performance)
spark.conf.set("spark.sql.shuffle.partitions",4)

# Read in some annotations.  The below reads in the Original Markup annotations (think XML)
omAnnots = GetAQAnnotations(spark.read.parquet("./tests/resources/om/"),
                            props=["*"],
                            decodeProps=["*"],
                            numPartitions=int(spark.conf.get("spark.sql.shuffle.partitions"))) \
                            .persist(StorageLevel.DISK_ONLY)

# Read in some more annotations.  The below reads in NLP annotations generated by Genia.
geAnnots = GetAQAnnotations(spark.read.parquet("./tests/resources/genia/"),
                            props=["orig", "lemma", "pos"],
                            decodeProps=["orig", "lemma"],
                            numPartitions=int(spark.conf.get("spark.sql.shuffle.partitions"))) \
                            .persist(StorageLevel.DISK_ONLY)

# Find those sentence annotations contained in a ce:para (XML element)
sentenceParaAnnots = ContainedIn(FilterType(geAnnots,'sentence'),FilterType(omAnnots,'ce:para'))

Query examples

To get you started, you can find below some more examples using the various query functions. Note that you can combine them to express more complex queries.

# Find the annotations "simple" that are within 25 characters after an annotation "the".
results = After(FilterProperty(geAnnots,"orig","simple"), FilterProperty(geAnnots,"orig","the"),25)

# Find the annotations "the" that are within 25 characters before an annotation "simple"
results = Before(FilterProperty(geAnnots,"orig","the"), FilterProperty(geAnnots,"orig","simple"),25)

# Find the annotations "simple" that are after an annotation "the" and before an annotation "end"
results = Between(FilterProperty(geAnnots,"orig","simple"), FilterProperty(geAnnots,"orig","the"), FilterProperty(geAnnots,"orig","end"))

# find those annotations "sentence" that are contained in "ce:para"
results = ContainedIn(FilterType(geAnnots,"sentence"), FilterType(omAnnots,"ce:para"))

# find those annotations "sentence" that contain an annotation "simple"
results = Contains(FilterProperty(geAnnots,"orig","sentence"), FilterProperty(geAnnots,"orig","simple"))

# filter the property field "orig" with the value "simulator". A single value or an array of values can be used for the filter comparison.
results = FilterProperty(geAnnots,"orig","simulator")

# filter the annotation sets with the value "ge"
results = FilterSet(geAnnots,"ge")

# filter the annotation types with the value "sentence"
results = FilterType(geAnnots,"sentence")

# return the following "sentence" annotation that contains an annotation "simple" within the annotation anchor: "ce:para"
sentences = Contains(FilterType(geAnnots,"sentence"),FilterProperty(geAnnots,"orig","simple"))
results = Following(FilterType(geAnnots,"sentence"), sentences,FilterType(omAnnots,"ce:para"))

# return the preceding "sentence" annotation that contains an annotation "simple" within the annotation anchor: "ce:para"
val sentences = Contains(FilterType(geAnnots,"sentence"),FilterProperty(geAnnots,"orig","simple"))
val results = Preceding(FilterType(geAnnots,"sentence"), sentences,FilterType(omAnnots,"ce:para"))

# find the annotations "basic" that are in the same document as the annotation "simple" and also match values on the specified property "parentId"
results = MatchProperty(FilterProperty(geAnnots,"orig","basic"), FilterProperty(geAnnots,"orig","simple"),"parentId")

# filter the property field "orig" using a regex expression "^sim*"
results = RegexProperty(geAnnots,"orig","^sim*")

# Sequence find the annotations "the" that are within 50 characters before an annotation "simple"
results = Sequence(FilterProperty(geAnnots,"orig","the"), FilterProperty(geAnnots,"orig","simple"), 50)