The NOISE database is built from the historical trade indexes of the British Chamber of Commerce for Switzerland (1921–1958) and related sources. It provides structured, searchable information on companies and individuals active in Anglo-Swiss trade during the first half of the twentieth century.
The data was created in several stages:
Digitization and Publication
The original trade indexes were digitized by the Swiss Economic Archives (SWA) and published on e-manuscripta, making the scanned pages openly accessible.
Corpus Building with Transkribus
The scanned volumes were processed in Transkribus to build the research corpus. This included OCR and the manual correction of text regions (“snippet boxing”) to ensure reliable segmentation and transcription.
Export to ALTO XML
The corrected transcriptions were exported in ALTO XML format, providing a structured representation of text regions on each page.
Pattern- and Dictionary-Based Parsing
Using custom parsing routines, transcription fragments were marked as typed entities (e.g. location, goods, company name). This step combined manually built dictionaries and regular expression patterns.
Development of the Parser
During this process, the simple-alto-parser Python package was developed to streamline dictionary building and pattern application for similar projects.
Entity Grouping and Identification
All typed entities were grouped and matched to unique identifiers. For example, “Basel” and “Basle” were recognized as variants of the same location and linked to a single GeoNames ID. This process was applied to multiple entity types (locations, companies, goods, etc.).
Curation and Publication
The curated data was structured according to FAIR data principles, published on Zenodo, and stored in a MongoDB database. This database powers the search and company views on this website