Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home

Spanish FrameNet

Overview

  • Spanish FrameNet (SFN) is a research project which is developed at the Autonomous University of Barcelona (Spain) and the International Computer Science Institute (Berkeley, CA) in cooperation with the FrameNet Project.
    • The Spanish FrameNet Project is creating an online lexical resource for Spanish, based on frame semantics and supported by corpus evidence.
    • The "starter lexicon" is available to the public, and contains more than 1,000 lexical items (verbs, predicative nouns, and adjectives, adverbs, prepositions and entities) representative of a wide range of semantic domains.
    • The aim is to document the range of semantic and syntactic combinatory possibilities (valences) of each word in each of its senses, through:
      • human approved and automatic annotated example sentences and
      • automatic capture and organization of the annotation results.
    • The Spanish FrameNet database will be in a platform-independent format, and it is able to be displayed and queried via the web and other interfaces.
    • Previous funding was provided by the Ministry of Economy and Competitiveness of Spain (Grants Nr. FFI2017-84460-P, FFI2014-56444-C2-1-P and FFI2011-23231), the Department of Science and Innovation (FFI2008-0875), and the Department of Science and Technology (TIC2002-01338). Additional funding has also been provided by the Fundación Comillas, the Department of Education (TSI2005-01200), and the Autonomous University of Barcelona (PNL2004-49 and PRP2006-04).

      Basic Concepts

      • A semantic frame is a script-like structure of inferences, linked by linguistic convention to the meanings of linguistic units -in our case, lexical units.

      • Each frame identifies a set of frame elements (FEs) -participants and props in the frame.

      • A frame semantic description of a lexical unit identifies the frames which underlie a given meaning and specifies the ways in which FEs, and constellations of FEs, are realized in constructions headed by the word.

      • Valence descriptions provide, for each word sense, information about the sets of combinations of:
        • FEs,
        • grammatical functions and
        • phrase types attested in the corpus.
      • The annotated sentences are the building blocks of the database. These are marked up in XML and form the basis of the lexical entries. This format supports searching by lemma, frame, frame element, and combinations of these.

      Dictionary & Thesaurus

      • The Spanish FrameNet database will act both as a dictionary and a thesaurus.
        • The dictionary features include:
          • definitions,
          • tables showing how frame elements are syntactically expressed in sentences containing each word,
          • annotated examples from the corpus:
            • human approved and
            • automatically annotated, and
            • an alphabetical index.
        • Like a thesaurus, words are linked to the semantic frames in which they participate, and frames, in turn, are linked to wordlists and to related frames.
      • The Spanish FrameNet project is based on the evidence offered by a 350 million-word corpus which includes both New World and European Spanish.
      • The semantic and syntactic annotation is carried out by using the system developed by the Berkeley FrameNet Project, whose input are files that have been extracted from the corpus, POS tagged, lemmatized, and chunked.
      • Each Spanish FrameNet entry will provide links to other lexical resources, including Spanish WordNet synsets.
      • The project's deliverables will consist of the Spanish FrameNet database itself:
        • lexical entries for individual word senses,
        • frame descriptions, and
        • annotated subcorpora.

      Frame Elements

      External FEs
      • External FEs are realized outside of the maximal phrase headed by the target lexeme.
      • Externals satisfy an FE requirement of a target word in the following syntactic contexts:
        • subjects of finite target verbs, A Juan le encanta [la paella]External (John loves paella) or target nouns and adjectives, by virtue of their grammatical relation to a support verb, such as the subjects of [El presidente]External les dio un ultimátum a los terroristas (The president gave the terrorists an ultimatum), [Venezuela]External es rica en tradiciones (lit. Venezuela is rich in traditions), etc.;
        • subjects (or objects) of controlling structures, [Los políticos]External decidieron bajar los impuestos (Politicians decided to lower taxes), [Le]External obligaron a firmar el contrato (They forced him to sign the contract);
        • "extracted" constituents, etc.


      Implicit FEs
      • Some FEs are conceptually "understood", but are not expressed in relevant positions in the sentence. In order for example sentences representing the same valence to be grouped together automatically, we introduced tags to bear the annotation for the missing FEs. We distinguish three types of Implicit FEs, Existential, Anaphoric, and Constructionally Licensed:
        • Existential implicit FEs include the missing objects of ¿Ya has comido? (Have you eaten yet?) and Bebes demasiado (You drink too much).
        • Anaphoric implicit FEs include the missing objects of Ellos decidirán (They'll decide) and ¿Comprendes? (Do you understand?).
        • Constructionally licensed omissions, e.g., subject null instantiation, the implicit subjects of imperatives, "patients" of passive sentences, etc.


      Conflated FEs
      • Many lexemes that can express several FEs as separate constituents can also express them as single constituents in which information about two FEs is conflated. In Le han nombrado director general (They have named him general director) the person and the office show up as separate constituents; but in Han nombrado al director general (They named the general director) we find a single constituent identifying both.


      Incorporated FEs
      • We sometimes find FEs which are typically expressed as separate constituents in the valence patterns of one lexeme, but are typically incorporated in other lexemes in the same frame. For example, in the Shoot projectiles frame, the Firearm is a separate constituent in Les dispararon con una ametralladora (They shot at them with a machine gun), but incorporated in Les ametrallaron (They machinegunned them).

      Complex Frames

      • The frames underlying some lexical units are best understood as comprising more than one frame, through simple frame inheritance, multiple frame inheritance with specifications of FE, binding, and frame composition:
        • Frame Inheritance
          The combinatorial properties of, say, empujar (push) are not only those determined by the unique meaning of the verb and the immediate frame in which it participates (CAUSE_TO_MOVE), but also by the fact that it is an action verb (involving agent, patient, optional instrument, etc.), and that it is an event verb (allowing specification of temporal and locational parameters). Thus properties of more general frames are inherited by more specific ones.
        • Multiple Frame Inheritance
          Compare despreciar (scorn), admirar (admire), criticar (criticize), and adular (flatter).  All of these verbs belong to a class of JUDGMENT verbs, involving one person passing judgment on the behavior of another. Of these, criticar and adular are also speaking verbs, and in these cases the Judge in the JUDGMENT frame is also the Speaker in the SPEAKING frame. In addition, while criticar allows non-identity of the Evaluee of the JUDGMENT frame and the Addressee of the SPEAKING frame as in Él me criticó en la prensa (He critized me in the newspapers), adular requires a single participant for both FEs.
          Consider the sense of discutir (argue), which inherits from both the frames DISPUTE and CONVERSATION. While discutir, like all CONVERSATION words, involves "reciprocal talk", some of its properties are inherited from the grammar of disputing or fighting. In this, it differs from other conversation verbs, like charlar (chat), comentar (comment), etc. Both DISPUTE and CONVERSATION frames, in turn, inherit the RECIPROCITY frame, which allows variable syntactic realization of the participants: either joint, as in Ellos discutieron (They argued), or disjoint, as in Él discutió con ella (He argued with her).  CONVERSATION is also heir to the SPEAKING frame, while DISPUTE inherits the ATTACKING frame.
        • Frame Composition
          In numerous cases a frame is complex because it contains another frame as one of its parts.  Compare Sacudió el mantel (He shook the tablecloth) and Sacudió las migas del mantel (He shook the crumbs out of the tablecloth).  The shaking (i.e. direct manipulation) is applied to the direct object in the first sentence, but is only a component of the full scene associated with the second sentence. We believe that Frame Semantics, combined with a feature-value representation of event structure, will provide new insights into much current work on this type of regular polysemy, which deals with patterns of valence variation.

      Corpus

        • These texts of various origins and genres make a grand total of 937 million words.
        • Spanish FrameNet wishes to acknowledge the support of Anthropos Editorial (Barcelona, Spain), Diario ABC (Madrid, Spain), and El Mundo (Madrid, Spain), which made it possible for this research project to use excerpts of their texts and publications as the evidential basis for the inquiry into the behavior of Spanish words.
        • The SFN Corpus includes the following subcorpora:
            • The Spanish corpus of the Sketch Engine has also been used to complement the subcorporation of certain lexical units.

             

            Taggers and chunkers

            Tagging

            The SFN Corpus has been tagged and lemmatized by using an electronic dictionary of Spanish with 634,503 wordforms, expanded from a dictionary with 113,825 lemmas:
            • 86,104 single-word lexical units, like unir (unite), inmoralidad (immorality), allí (there), etc.;
            • 25,721 multi-word lexical units (MWLU), like muerte cerebral (brain death), carga de profundidad (depth charge), ácido graso no saturado (unsaturated fatty acid), etc.

            Each form in the expanded dictionary is associated with a set of tags that supply information about:
            • the lemma,
            • the POS, and
            • the inflectional properties of:
              • verbs (mood, tense, person, and number), and
              • nouns, adjectives, and past participles (gender and/or number).

            The output of the tagger is a set of finite state automata (FSAs) -- one per sentence -- whose transitions are tagged with the wordform information supplied by the electronic dictionary (see Fig. 1). The FSAs are displayed using a graphical interface that also allows us to create or redraw FSAs and finite state transducers (FSTs).

            Fig. 1. The sentence Envió la propuesta al ministro de defensa ('send.3sg the proposal to the defense minister') tagged and converted into an FSA.

             

            Resolving ambiguities

            Wordform-level ambiguities are represented in deterministic finite state automata (DFSAs) as different possible transitions between two consecutive states. Graphically, ambiguities are shown as different transitions inside a state, which is represented as a box (see Fig. 1). For instance, the second state of the DFSA of Fig. 1 shows the ambiguities of the form la, which is tagged as:

            • a feminine singular determiner associated with the lemma el (the);
            • a feminine singular clitic pronoun associated with the lemma lo (it, him); or
            • a masculine singular noun associated with the lemma la (the musical note A).

            Similarly, the third state of the DFSA of Fig. 1 shows the amguities of propuesta (proposition), which is tagged as:

            • a past participle of the verb proponer (propose); and as
            • a singular form of the feminine noun propuesta.

            Ambiguities like the ones just mentioned can be resolved by intersecting the tagged DFSAs (see Fig. 1) with FSTs that specify local disambiguating conditions.  For instance, past participles are not usually preceded by masculine or feminine clitic pronouns unless they are followed by nouns. Thus the following FST would resolve the ambiguity of the NP la propuesta (the proposal) in sentence (1).

            ( <DET>  + <CLI:m> + <CLI:f> ) (<V:PP> + <N> ) !<N>  --->  <DET> <N>

            This FST can be further refined by specifying local conditions of gender and number agreement inside the NPs.

            Multi-word verb recognition

            Multi-word verbs are recognized by intersecting the output of the tagger with FSTs which specify their relevant lexical properties. MWLU recognition can sometimes be part of the desambiguation process. For instance, recognizing a multi-word verb when there are clitic pronouns, adverbs or other lexical material between the verbal head and the rest of a multi-word verb (MLV) (see Fig. 4) can be done by intersecting the output FSA of the tagger with a FST that specifies the position and the type of lexical material that can appear inside an MLV (see Fig. 3). MWLU recognition, and elimination of ambiguities related to lexical forms inside multi-word verbs are carried out simultaneously with 2,000 FST that specify lexically relevant properties of multi-word verbs. For example the FST in Fig. 3 chunks the multi-word verb dar por sentado (take for granted). The transducer transforms the chunked expression so that the multi-word verb is converted into a single lemma: Fig. 4 shows the FSA of sentence Max da siempre por sentado demasiadas cosas (Max takes always for granted too many things), and Fig 5 is the transduced FSA, where da siempre por sentado has been detected and converted into a single-lexical unit, i.e. a single transition in the resulting FSA. The adverb siempre (always) has been moved after the verb, which is the default position in Spanish. Note that the transduction process has also removed word ambiguity inside the chunk s. After the chunking process, these forms are no longer ambiguous, because the lexical construction that defines the FST removes all ambiguties after identifying the corresponding LU.

             

            Fig. 3. Subsequential FST that detects the multi-word verb dar por sentado. The tag PALABRA  (word) matches any word sequence, from 0 up to 4 words.

             

            Fig. 4. Input DFSA of the sentence Max da siempre por sentado demasiadas cosas (Max takes always for granted too many things) before the transduction.

             

            Fig. 5. Output DFSA of the sentence Max da siempre por sentado demasiadas cosas (Max takes always for granted too many things) after the  intesection and transduction. The recognized multi-word verb appears in the green box.

             

            Chunking

            Chunking is carried out by using FST that specify:

            • all single and multi-word verbal forms (except passive forms), like estamos estudiando (are studying), ha estado trabajando (has been working), etc.;
            • noun phrases (NPs), and
            • prepositional phrases (PN).

             

            Advantages of the SFN tagging and chunking system

            The advantages of the SFN tagging and chunking system over other systems that also use electronic dictionaries are threefold. First, the SFN system includes the largest  Spanish electronic dictionary of MWLU available:

            • 25,721 nouns, adjectives, adverbs, and predicative prepositional phrases, like en peligro (in danger), etc., which are automatically expanded in 55,000 inflected MWLU forms.

            Second, the SFN system includes 3,009 FST of 2043 multi-word verbs, like dar por sentado (take for granted), etc. (cf. Fig. 5). These FST allow both:

            • recognition of MWLU --even if there is lexical material like clitic pronouns, adverbs, etc., between the head and the rest of multi-word verb as shown above--, and
            • simultaneous disambiguation of the constituents of the lexical forms involved in the recognized MWLUs.

            Third, the lexical coverage  of our electronic dictionaries of both single and MWLU LUs has been checked against the SFN Corpus, which has 436 million words. This process allowed us both to include missing entries, and to eliminate entries which appear usually in Spanish traditional dictionaries, but do not really exist in the present-day language (whether written or oral). Eliminating unexistent lexical forms adapts the size of the dictionary to the actual present-day Spanish language, thus avoiding the noise caused by unexisting forms in natural language processing.

            Papers

            SFN Papers
            Papers
            Other FrameNet papers
            http://sato.fm.senshu-u.ac.jp/_web/papers/paper.html
            Other Publications

            People

             

            Principal Investigator:
            Carlos Subirats (CV)

            System Analyst:
            Marc Ortega (Autonomous University of Barcelona)

            Linguists:
            Julia Bernd (International Computer Science Institute), Álvaro Bueno (Universidad Complutense de Madrid), Michael Ellsworth (International Computer Science Institute)

            Consultants:
            Collin Baker (International Computer Science Institute), Isabel Verdaguer (University of Barcelona)